Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Bellow, you will find my analysis for the Titanic challenge created by the Kaggle team. The Competiotion started on September 28, 2012 and will be finished on December 31, 2016.

To create my own solution for this challenge, I studied about the Titanic and its survivors.

The Titanic Decks

The Titanic Boat Deck plan with the lifeboats location

Titanic Class System

Some information can be found in websites as https://www.encyclopedia-titanica.org/ or http://www.titanicfacts.net/titanic-passengers.html.

After spending some time understanding the situation from where the data were obtained, I started digging the available datasets.

Install Packages

First I installed all the R packages necessary for the analysis. The Packages can be loaded along the way.

#install.packages("htmlwidgets")
#install_github("easyGgplot2", "kassambara")
#install.packages("devtools")
#library(htmlwidgets)
library('ggplot2') 
library('ggthemes') 
library('scales') 
library('dplyr') 
library('mice') 
library('randomForest') 
library('Hmisc')
library('reshape2') 
library('stringr')
library('plyr') 
library('gridExtra')
library('devtools')
library('easyGgplot2')
library('vcd')
library('rpart')
library('rattle')
library('rpart.plot')
library('RColorBrewer')
library('caret') 

Reading Datasets

Create functions during the analysis is always important, especially when you have to deal with many repeated actions.

To facilitate the datasets reading, I used a function avaliable on the Internet.

readData <- function(fileName, VariableType, missingNA) {
        read.csv2(fileName, sep=",",dec = ".",
                  colClasses=VariableType,
                  na.strings=missingNA)
}

train.data <- "train.csv"
test.data <- "test.csv"
missingNA <- c("NA", "")
train.VariableType <- c('integer',   # PassengerId
                        'numeric',   # Survived 
                        'factor',    # Pclass
                        'character', # Name
                        'factor',    # Sex
                        'numeric',   # Age
                        'integer',   # SibSp
                        'integer',   # Parch
                        'character', # Ticket
                        'numeric',   # Fare
                        'character', # Cabin
                        'factor'     # Embarked
)

test.VariableType <- train.VariableType[-2]     ## There is no "Survived" variable in the test file

dt.train <- readData(train.data, train.VariableType, missingNA)
dt.test <- readData(test.data,test.VariableType, missingNA)

Titanic train Dataset

The first step to work with Machine Learning is to evaluate the train dataset. For this step I summarizing the it.

summary(dt.train)
##   PassengerId       Survived      Pclass      Name               Sex     
##  Min.   :  1.0   Min.   :0.0000   1:216   Length:891         female:314  
##  1st Qu.:223.5   1st Qu.:0.0000   2:184   Class :character   male  :577  
##  Median :446.0   Median :0.0000   3:491   Mode  :character               
##  Mean   :446.0   Mean   :0.3838                                          
##  3rd Qu.:668.5   3rd Qu.:1.0000                                          
##  Max.   :891.0   Max.   :1.0000                                          
##                                                                          
##       Age            SibSp           Parch           Ticket         
##  Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
##  1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
##  Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
##  Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
##  3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
##  Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
##  NA's   :177                                                        
##       Fare           Cabin           Embarked  
##  Min.   :  0.00   Length:891         C   :168  
##  1st Qu.:  7.91   Class :character   Q   : 77  
##  Median : 14.45   Mode  :character   S   :644  
##  Mean   : 32.20                      NA's:  2  
##  3rd Qu.: 31.00                                
##  Max.   :512.33                                
## 

From the summary table, we can see that the variables Age and Embarked have missing data and we will deal with them later.

Here, we labeled the categorical variables to make them easier to read.

dt.train$Survived <- factor(dt.train$Survived, levels=c(1,0))
levels(dt.train$Survived) <- c("Survived", "Died")

dt.train$Pclass <- as.factor(dt.train$Pclass)
levels(dt.train$Pclass) <- c("1st Class", "2nd Class", "3rd Class")

dt.train$Sex <- factor(dt.train$Sex, levels=c("female", "male"))
levels(dt.train$Sex) <- c("Female", "Male")
mosaicplot(Pclass ~ Sex,
           data=dt.train, main="Titanic Training Data Passenger Survival by Class",
           color=c("#8dd3c7", "#fb8072"), shade=FALSE,  xlab="", ylab="",
           off=c(0), cex.axis=1.4)

table(dt.train$Pclass,dt.train$Sex)
##            
##             Female Male
##   1st Class     94  122
##   2nd Class     76  108
##   3rd Class    144  347
round(prop.table(table(dt.train$Pclass,dt.train$Sex),1),3)
##            
##             Female  Male
##   1st Class  0.435 0.565
##   2nd Class  0.413 0.587
##   3rd Class  0.293 0.707

Analysing the figure and the tables above, it is clear that there were more men than women in the Titanic traning dataset, especially in the third class.

mosaicplot(Sex ~ Survived, 
           data=dt.train,
           color=c("#8dd3c7", "#fb8072"), shade=FALSE,  xlab="", ylab="",
           off=c(0), cex.axis=1.4,
           main="Titanic Training Data\nPassenger Survival by Sex")

table(dt.train$Sex,dt.train$Survived)
##         
##          Survived Died
##   Female      233   81
##   Male        109  468
round(prop.table(table(dt.train$Sex,dt.train$Survived),1),3)
##         
##          Survived  Died
##   Female    0.742 0.258
##   Male      0.189 0.811
mosaicplot(Pclass ~ Survived,
        data=dt.train,
        color=c("#8dd3c7", "#fb8072"), shade=FALSE,  xlab="", ylab="",
        off=c(0), cex.axis=1.4, 
        main="Titanic Training Data\nPassenger Survival by Class")

table(dt.train$Pclass,dt.train$Survived)
##            
##             Survived Died
##   1st Class      136   80
##   2nd Class       87   97
##   3rd Class      119  372
round(prop.table(table(dt.train$Pclass,dt.train$Survived),1),3)
##            
##             Survived  Died
##   1st Class    0.630 0.370
##   2nd Class    0.473 0.527
##   3rd Class    0.242 0.758

The graphs and tables above show that the proportion of survivors are higher for female (74% vs 19%) and also between the first class passenger (63%), followed by the second (47%) and third class (24%), respectively.

h<-ggplot(dt.train,aes(x = Pclass, fill = Survived,y = (..count..))) +
        geom_bar() + labs(y = "Count")+
        labs(title="Titanic Training Data: Survived by Class")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

p<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x = Pclass, fill = Survived,y = (..count..))) +
        geom_bar() + labs(y = "Count")+
        labs(title="Female by Class")
p1<-p + scale_y_continuous(limits = c(0, 350))
p2<-p1+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p3<-p2+scale_color_manual(values=c("#8dd3c7","#fb8072"))

q<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x = Pclass, fill = Survived,y = (..count..))) +
        geom_bar() + labs(y = "Count")+
        labs(title="Male by Class")
q1<-q + scale_y_continuous(limits = c(0, 350))
q2<-q1+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q3<-q2+scale_color_manual(values=c("#8dd3c7","#fb8072"))

grid.arrange(h2, ncol=1, nrow =1)

grid.arrange(p2, q2, ncol=1, nrow =2)

Now, let’s do the analysis of the survivors for each class and gender. For the female, we observe only a few deaths on the first and the second classes, with most of them happening on the third class (almost 50%). For male, we see higher proportion of survivors on the first class, but it does not look like to have a pattern as the worst survivor porportion is on the second class.

mosaicplot(SibSp ~ Survived, 
           data=dt.train,
           color=c("#8dd3c7", "#fb8072"), shade=FALSE,  xlab="", ylab="",
           off=c(0), cex.axis=1.4,
           main="Titanic Training Data\nPassenger Survival by the Number of Siblings/Spouses Aboard")

table(dt.train$SibSp,dt.train$Survived)
##    
##     Survived Died
##   0      210  398
##   1      112   97
##   2       13   15
##   3        4   12
##   4        3   15
##   5        0    5
##   8        0    7
round(prop.table(table(dt.train$SibSp,dt.train$Survived),1),3)
##    
##     Survived  Died
##   0    0.345 0.655
##   1    0.536 0.464
##   2    0.464 0.536
##   3    0.250 0.750
##   4    0.167 0.833
##   5    0.000 1.000
##   8    0.000 1.000

Family mambers

The following is about the number of siblings and spouses, and parents and children aboard on the Titanic.

We can interpret that a person accompanied by one or two family members seems to had higher chance to survive.

h<-ggplot(dt.train,aes(x=SibSp, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=10)  +
        labs(title="Titanic Training Data: \nNumber of Siblings/Spouses Aboard by Variable Survived")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

q<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x=SibSp, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=10)  +
        labs(title="Number of Siblings/Spouses Aboard for Female")
q1<-q+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

p<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x=SibSp, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=10) +
        labs(title="Number of Siblings/Spouses Aboard for Male")
p1<-p+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p2<-p1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

grid.arrange(h2, ncol=1, nrow =1)

grid.arrange(q2, p2, ncol=1, nrow =2)

mosaicplot(Parch ~ Survived, 
           data=dt.train,
           color=c("#8dd3c7", "#fb8072"), shade=FALSE,  xlab="", ylab="",
           off=c(0), cex.axis=1.4,
           main="Titanic Training Data\nPassenger Survival by the Number of Parents/Children Aboard")

table(dt.train$Parch,dt.train$Survived)
##    
##     Survived Died
##   0      233  445
##   1       65   53
##   2       40   40
##   3        3    2
##   4        0    4
##   5        1    4
##   6        0    1
round(prop.table(table(dt.train$Parch,dt.train$Survived),1),3)
##    
##     Survived  Died
##   0    0.344 0.656
##   1    0.551 0.449
##   2    0.500 0.500
##   3    0.600 0.400
##   4    0.000 1.000
##   5    0.200 0.800
##   6    0.000 1.000
h<-ggplot(dt.train,aes(x=Parch, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=10)  +
        labs(title="Titanic Training Data: \nNumber of Parents/Children Aboard by Variable Survived")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

q<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x=Parch, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=10)  +
        labs(title="Number of Parents/Children Aboard for Female")
q1<-q+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

p<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x=Parch, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=10) +
        labs(title="Number of Parents/Children Aboard for Male")
p1<-p+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p2<-p1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

grid.arrange(h2, ncol=1, nrow =1)

grid.arrange(q2, p2, ncol=1, nrow =2)

Most of the passengers embarked at the Southampton, followed by Cherbourg and Queenstown ports, respectivelly. Even though the highest proportion of survivor came from Cherbourg.

dt.train$Embarked[which(is.na(dt.train$Embarked))] <- 'S'# The most commom value

mosaicplot(Embarked ~ Survived, 
           data=dt.train,
           color=c("#8dd3c7", "#fb8072"), shade=FALSE,  xlab="", ylab="",
           off=c(0), cex.axis=1.4,
           main="Titanic Training Data\nPassenger Survival by Port of Embarkation")

table(dt.train$Embarked,dt.train$Survived)
##    
##     Survived Died
##   C       93   75
##   Q       30   47
##   S      219  427
round(prop.table(table(dt.train$Embarked,dt.train$Survived),1),3)
##    
##     Survived  Died
##   C    0.554 0.446
##   Q    0.390 0.610
##   S    0.339 0.661

There are several methods of imputation. For the age, I choose to use the median based on the title. Whivh I understend to have some relationship with the age of the individual.

The variable title can be extracted from the name of each individual. Grab title from passenger names.

dt.train$Title <- gsub('(.*, )|(\\..*)', '', dt.train$Name)
table(dt.train$Title)
## 
##         Capt          Col          Don           Dr     Jonkheer 
##            1            2            1            7            1 
##         Lady        Major       Master         Miss         Mlle 
##            1            2           40          182            2 
##          Mme           Mr          Mrs           Ms          Rev 
##            1          517          125            1            6 
##          Sir the Countess 
##            1            1
options(digits=2)
with(dt.train,bystats(Age, Title, 
        fun=function(x)c(Mean=mean(x),Median=median(x))))
## 
##  c(6, 13, 6, 55, 13, 55, 6, 6) of Age by Title 
## 
##                N Missing Mean Median
## Capt           1       0 70.0   70.0
## Col            2       0 58.0   58.0
## Don            1       0 40.0   40.0
## Dr             6       1 42.0   46.5
## Jonkheer       1       0 38.0   38.0
## Lady           1       0 48.0   48.0
## Major          2       0 48.5   48.5
## Master        36       4  4.6    3.5
## Miss         146      36 21.8   21.0
## Mlle           2       0 24.0   24.0
## Mme            1       0 24.0   24.0
## Mr           398     119 32.4   30.0
## Mrs          108      17 35.9   35.0
## Ms             1       0 28.0   28.0
## Rev            6       0 43.2   46.5
## Sir            1       0 49.0   49.0
## the Countess   1       0 33.0   33.0
## ALL          714     177 29.7   28.0

I found on the internet the follow imputeMedian function and it described as a function that receive the variable with missing (VarImpute), the variable used as filter (VarFilter) for the median imputation, and the levels of the filter variable (VarLevels).

imputeMedian <- function(VarImpute, VarFilter, VarLevels) {
        for (i in VarLevels) {
                VarImpute[ which(VarFilter == i)] <- impute(VarImpute[ 
                        which( VarFilter == i)])
        }
        return (VarImpute)
}
unique(dt.train$Title)
##  [1] "Mr"           "Mrs"          "Miss"         "Master"      
##  [5] "Don"          "Rev"          "Dr"           "Mme"         
##  [9] "Ms"           "Major"        "Lady"         "Sir"         
## [13] "Mlle"         "Col"          "Capt"         "the Countess"
## [17] "Jonkheer"
## list of all titles 
titles <- c("Mr","Mrs","Miss","Master","Don","Rev",
                     "Dr","Mme","Ms","Major","Lady","Sir",
                     "Mlle","Col","Capt","the Countess","Jonkheer","Dona")

dt.train$Age[which(dt.train$Title=="Dr")]
## [1] 44 54 23 32 50 NA 49
dt.train$Age <- imputeMedian(dt.train$Age,dt.train$Title,titles)
dt.train$Age[which(dt.train$Title=="Dr")] #Checking imputation
## [1] 44 54 23 32 50 46 49
h<-ggplot(dt.train,aes(x=Age, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Training Data: Age by Variable Survived")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))


q<-ggplot(subset(dt.train, dt.train$Sex=="Female"),aes(x=Age, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Age of Female")
q1<-q+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))

p<-ggplot(subset(dt.train, dt.train$Sex=="Male"),aes(x=Age, fill=Survived, color=Survived)) +
        geom_histogram(position="identity", alpha=0.5,bins=90) +
        labs(title="Age of Male")
p1<-p+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
p2<-p1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(h2, ncol=1, nrow =1)

grid.arrange(q2, p2, ncol=1, nrow =2)

Evaluating the distribution of survivors per age by gender, aparently the age was not important if the individual was female. For male, the highest proportion of survivors occoured for passengers under the age of 15.

The follow histogram present the distribution of age per class.

q<-ggplot(dt.train, aes(x=Age, fill=Pclass)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Training Data: Age by Class")
q1<-q+scale_fill_manual(name="Class",values=c("blue","green", "red"))
q2<-q1+scale_color_manual(values=c("blue","green", "red"))
q2

Because the variable title has to many categories, we are going to create a new title according to the following rules.

dt.train$NewTitle[dt.train$Title %in% c("Capt","Col","Don", "Dr", "Major","Rev")] <- 0 #There are a Woman Dr
dt.train$NewTitle[dt.train$Title %in% c("Lady", "Mme", "Mrs", "Ms", "the Countess")] <- 1
dt.train$NewTitle[dt.train$Title %in% c("Master")] <- 2
dt.train$NewTitle[dt.train$Title %in% c("Miss", "Mlle")] <- 3
dt.train$NewTitle[dt.train$Title %in% c("Mr", "Sir", "Jonkheer")] <- 4
dt.train$NewTitle <- factor(dt.train$NewTitle)
dt.train$NewTitle <- as.factor(dt.train$NewTitle)
levels(dt.train$NewTitle) <- c("Special", "Mrs", "Master","Miss","Mr")

table(dt.train$NewTitle, dt.train$Survived)
##          
##           Survived Died
##   Special        5   14
##   Mrs          103   26
##   Master        23   17
##   Miss         129   55
##   Mr            82  437
round(prop.table(table(dt.train$NewTitle, dt.train$Survived),1),3)
##          
##           Survived Died
##   Special     0.26 0.74
##   Mrs         0.80 0.20
##   Master      0.57 0.42
##   Miss        0.70 0.30
##   Mr          0.16 0.84

The tables above seems that the new title follow the idea to code “children and women first”. Where Master (boys under the age 13), Miss and Mrs. have the highest survivor rate.

With this noted, I am going to create new variables that separate: children (under the age 13, independent of gender), adult women and adult male. children (under the age 15, independent of gender), adult women and adult male.

Code: Women and Children First

Based on the fact that during a disaster the priority are women and children first, we are going to create a variable that separate children, women, and men.

dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Master")] <- 0
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age<=12] <- 0
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age>12] <- 1
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Mrs")] <- 1
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Male"] <- 2 
dt.train$WomanChild12_1st[dt.train$NewTitle %in% c("Mr")] <- 2
dt.train$WomanChild12_1st <- as.factor(dt.train$WomanChild12_1st)
levels(dt.train$WomanChild12_1st) <- c("Children", "Women", "Men")

table(dt.train$WomanChild12_1st, dt.train$Survived)
##           
##            Survived Died
##   Children       42   30
##   Women         214   68
##   Men            86  451
round(prop.table(table(dt.train$WomanChild12_1st, dt.train$Survived),1),3)
##           
##            Survived Died
##   Children     0.58 0.42
##   Women        0.76 0.24
##   Men          0.16 0.84
table(dt.train$WomanChild12_1st, dt.train$NewTitle)
##           
##            Special Mrs Master Miss  Mr
##   Children       0   0     40   32   0
##   Women          1 129      0  152   0
##   Men           18   0      0    0 519
round(prop.table(table(dt.train$WomanChild12_1st, dt.train$NewTitle),1),3)
##           
##            Special   Mrs Master  Miss    Mr
##   Children   0.000 0.000  0.556 0.444 0.000
##   Women      0.004 0.457  0.000 0.539 0.000
##   Men        0.034 0.000  0.000 0.000 0.966
h<-ggplot(dt.train,aes(x = WomanChild12_1st, fill = Survived,y = (..count..))) +
        geom_bar() + labs(y = "Count")+
        labs(title="Titanic Training Data: Women and Children 1st Survival",x="")
h1<-h+scale_fill_manual(name="Women & Children (< 13 years)\nFirst",values=c("#8dd3c7","#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Master")] <-0
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age<=14] <- 0
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Miss") & dt.train$Age>14] <- 1
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Mrs")] <- 1
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Special") & dt.train$Sex=="Male"] <- 2 
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Mr") & dt.train$Age<=14] <- 0
dt.train$WomanChild14_1st[dt.train$NewTitle %in% c("Mr") & dt.train$Age>14] <- 2
dt.train$WomanChild14_1st <- as.factor(dt.train$WomanChild14_1st)
levels(dt.train$WomanChild14_1st) <- c("Children", "Women", "Men")

table(dt.train$WomanChild14_1st, dt.train$Survived)
##           
##            Survived Died
##   Children       46   34
##   Women         210   67
##   Men            86  448
round(prop.table(table(dt.train$WomanChild14_1st, dt.train$Survived),1),3)
##           
##            Survived Died
##   Children     0.57 0.42
##   Women        0.76 0.24
##   Men          0.16 0.84
table(dt.train$WomanChild14_1st, dt.train$NewTitle)
##           
##            Special Mrs Master Miss  Mr
##   Children       0   0     40   37   3
##   Women          1 129      0  147   0
##   Men           18   0      0    0 516
round(prop.table(table(dt.train$WomanChild14_1st, dt.train$NewTitle),1),3)
##           
##            Special   Mrs Master  Miss    Mr
##   Children   0.000 0.000  0.500 0.462 0.038
##   Women      0.004 0.466  0.000 0.531 0.000
##   Men        0.034 0.000  0.000 0.000 0.966
q<-ggplot(dt.train,aes(x = WomanChild14_1st, fill = Survived,y = (..count..))) +
        geom_bar() + labs(y = "Count")+ 
        labs(title="Titanic Training Data: Survival of Women and Children First code",x="")
q1<-q+scale_fill_manual(name="Women & Children (< 15 years)\nFirst",values=c("#8dd3c7","#fb8072"))
q2<-q1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(q2 ,ncol=1, nrow =1)

p<-ggplot(dt.train, aes(x=Age, fill=WomanChild12_1st)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
                labs(title="Titanic Training Data: Survival of Women and Children First code")
p1<-p+scale_fill_manual(name="Women & Children (< 13 years)\nFirst",values=c("green","blue", "pink"))
p2<-p1+scale_color_manual(values=c("green","blue", "pink"))

q<-ggplot(dt.train, aes(x=Age, fill=WomanChild14_1st)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Training Data: Survival of Women and Children First code")
q1<-q+scale_fill_manual(name="Women & Children (< 15 years)\nFirst",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
grid.arrange(p2,q2 ,ncol=1, nrow =2)

For the traning dataset, it does not seem to have difference between choosing the children age up to 12 or 14 years old. But I will check if there is any difference in the models.

We also observed the proportion of survivors is higher for adult women, followed by children and last by men.

As we belive that a passenger with a family had more chance to survive, we are going to evaluate if the size of the family matters. For that, I will create a variable that counts the number of family member on the Titanic (combining number of children, siblings, parents and spouses).

dt.train$FamilySize <- dt.train$SibSp + dt.train$Parch + 1 # Passeger + #Siblings + Spouse + Parents + Children Aboard
boxplot(Age ~ FamilySize, data =dt.train, xlab="Family Size on the Ship", 
                ylab="Age (years)", main="Titanic Training Data")

q <- ggplot(dt.train, aes(x=FamilySize, y=Age)) + geom_jitter(aes(colour = Survived),width = 0.25) 
q1 <- q + xlab("Family Size") 
q2 <- q1 + ylab("Age (years)")
q2

From the previous graphs, we can see that people with larger families were younger and had higher chance to die.

It seems that a passengers alone or with 5 o more family mamber on the ship are more likely to die, an individuo with a family with 2, 3 or 4 are more likely to survive. Because of that, I am going to categorize the family size, as follows.

dt.train$Fsize[dt.train$FamilySize == 1] <- 1
dt.train$Fsize[dt.train$FamilySize == 2] <- 2
dt.train$Fsize[dt.train$FamilySize == 3] <- 3
dt.train$Fsize[dt.train$FamilySize == 4] <- 4
dt.train$Fsize[dt.train$FamilySize >= 5] <- 5 
dt.train$Fsize <- as.factor(dt.train$Fsize)

levels(dt.train$Fsize) <- c("1", "2", "3","4","5+")
table(dt.train$Fsize, dt.train$Survived)
##     
##      Survived Died
##   1       163  374
##   2        89   72
##   3        59   43
##   4        21    8
##   5+       10   52
round(prop.table(table(dt.train$Fsize, dt.train$Survived),1),3)
##     
##      Survived Died
##   1      0.30 0.70
##   2      0.55 0.45
##   3      0.58 0.42
##   4      0.72 0.28
##   5+     0.16 0.84
with(dt.train,table(Fsize, Sex))
##      Sex
## Fsize Female Male
##    1     126  411
##    2      87   74
##    3      49   53
##    4      19   10
##    5+     33   29
round(prop.table(table(dt.train$Fsize, dt.train$Sex),1),3)
##     
##      Female Male
##   1    0.23 0.76
##   2    0.54 0.46
##   3    0.48 0.52
##   4    0.66 0.34
##   5+   0.53 0.47
h<-ggplot(dt.train,aes(x = Fsize, fill = Survived,y = (..count..))) +
        geom_bar() + labs(y = "Count")+
        labs(title="Titanic Training Data: Survived by Family Size on the Ship")
h1<-h+scale_fill_manual(values=c("#8dd3c7", "#fb8072"))
h2<-h1+scale_color_manual(values=c("#8dd3c7","#fb8072"))
grid.arrange(h2, ncol=1, nrow =1)

q <- ggplot(dt.train, aes(x=Fsize, y=Age)) + geom_jitter(aes(colour = Survived),width = 0.25) 
q1 <- q+ xlab("Family Size")
q2 <- q1 + ylab("Age (years)")
grid.arrange(q2, ncol=1, nrow =1)

Just for curiosity, I created a variable that estimates the family size on the training or test dataset. I did that so my models could adjust to the size of the family in the dataset evaluated.

First a created the FamilyID pasting the Family size Aboard on Titanic with the Passenger’s Surname

dt.train$FamilyName <- gsub(",.*$", "", dt.train$Name)
dt.train$FamilyID <- paste(as.character(dt.train$FamilySize), dt.train$FamilyName, sep="")

With the FamilyID, we can see that even though I got the information that the Sage Family had 11 family members aboard on the Titanic, we had information of only 7 members and all died. Maybe the other 4 members had survived and are on the test dataset.

The follow variable is meant to give a unique family identification for each passenger. For this analysis, we will assume that all the family member have the same number of family member on the Titanic, same Surname, same Embark port, and the same Ticket number.

Families with diferent Ticket numbers or who have Embarked in different port won’t be classified as family.

dt.train$FamilyID_Embk_Ticket <- paste(dt.train$FamilyID,dt.train$Embarked, as.character(dt.train$Ticket), sep="_")
dt.train$FamilyID_dataSet <- match(dt.train$FamilyID_Embk_Ticket, unique(dt.train$FamilyID_Embk_Ticket))
dt.train$FamilySize_dataSet <- ave(dt.train$FamilyID_dataSet,dt.train$FamilyID_dataSet, FUN =length)
summary(dt.train$FamilySize_dataSet)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     1.6     2.0     7.0
table(dt.train$FamilySize_dataSet,dt.train$FamilySize)
##    
##       1   2   3   4   5   6   7   8  11
##   1 533  81  33   4   1   1   1   0   0
##   2   4  80  42   8   2   0   0   0   0
##   3   0   0  27   9   0   0   0   0   0
##   4   0   0   0   8  12   4   4   0   0
##   5   0   0   0   0   0   5   0   0   0
##   6   0   0   0   0   0  12   0   6   0
##   7   0   0   0   0   0   0   7   0   7
plot(dt.train$FamilySize_dataSet,dt.train$FamilySize, xlab="Family Size in the dataset",
     ylab="Family Size on the Ship",main= "Titanic Training dataset")

As we were expecting, the variable family size on the train dataset worked well, assuming values equal or less than in the Titanic.

with(dt.train,bystats(Fare, Pclass, 
        fun=function(x)c(Mean=mean(x),Median=median(x))))
## 
##  c(2, 13, 2, 55, 13, 55, 2, 2) of Fare by Pclass 
## 
##             N Mean Median
## 1st Class 216   84   60.3
## 2nd Class 184   21   14.2
## 3rd Class 491   14    8.1
## ALL       891   32   14.5
q<-ggplot(dt.train, aes(x=Fare, fill=Pclass)) +
        geom_histogram(position="identity", alpha=0.5,bins=50)  +
        labs(title="Titanic Training Data: Fare by Class")
q1<-q+scale_fill_manual(name="Class",values=c("green","blue", "red"))
q2<-q1+scale_color_manual(values=c("green","blue", "red"))
grid.arrange(q2, ncol=1, nrow =1)

Checking the ticket price(fare) by class, apparently the median increase with the class.

with(dt.train,bystats(Fare, FamilySize, 
        fun=function(x)c(Mean=mean(x),Median=median(x))))
## 
##  c(2, 13, 2, 55, 13, 55, 2, 2) of Fare by FamilySize 
## 
##       N Mean Median
## 1   537   21    8.1
## 2   161   50   26.0
## 3   102   40   24.1
## 4    29   55   27.8
## 5    15   58   25.5
## 6    22   74   29.1
## 7    12   29   31.3
## 8     6   47   46.9
## 11    7   70   69.5
## ALL 891   32   14.5
with(dt.train, {
        boxplot(Fare ~ FamilySize, xlab="Family Size on the Titanic", 
                ylab="Fare", main="Titanic Training Data", col=2:10)
})

with(dt.train,bystats(Fare, Fsize, 
        fun=function(x)c(Mean=mean(x),Median=median(x))))
## 
##  c(10, 13, 10, 55, 13, 55, 10, 10) of Fare by Fsize 
## 
##       N Mean Median
## 1   537   21    8.1
## 2   161   50   26.0
## 3   102   40   24.1
## 4    29   55   27.8
## 5+   62   58   31.4
## ALL 891   32   14.5
with(dt.train, {
        boxplot(Fare ~ Fsize, xlab="Family Size on the Titanic", 
                ylab="Fare", main="Titanic Training Data", col=2:10)
        })

From the boxplots above, the fare tends to be different only if the passenger were alone, with the lowest median.

Models

To submit the models predictions on the Kaggle competition, I choose the ones with highest accuracy in the training dataset.

Table 1: Accuracy of adjusted models on the training datase.

Model Logistic Decision Tree Random Forest
1 0.787 0.787 0.787
2 0.787 0.792 0.798
3 0.796 0.820 0.818
4 0.790 0.835 0.857
5 0.793 0.835 0.850
6 0.804 0.835 0.869
7 0.796 0.840 0.848
8 0.806 0.840 0.864
9 0.826 0.841 0.850
10 0.819 0.841 0.868
11 0.792 0.832 0.846
12 0.810 0.835 0.864
13 0.833 0.834 0.852
14 0.831 0.834 0.861
15 0.834 0.835 0.844
16 0.827 0.835 0.860
17 0.832 0.834 0.844
18 0.832 0.834 0.834
19 0.833 0.835 0.834
20 0.833

From the table above, I selected the following models. * Model 9: Decision Tree (Survived ~ Sex + Age + Pclass + Fsize) ** (accuracy in the train dataset = 0.841, Kaggle score = 0.79426)

    * Model 6: Random Forest (Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked)
            ** (accuracy in the train dataset = 0.869, Kaggle score = 0.79426)

    * Model 19: Logistic (Survived ~ Pclass + Fsize + WomanChild12_1st)
            ** (accuracy in the train dataset = 0.833, Kaggle score = 0.78947)

    * Model 13: Logistic (Survived ~ Sex + Age + Pclass + Fsize + NewTitle)
            ** (accuracy in the train dataset = 0.833, Kaggle score = 0.78469)
    
    * Model 20: Logistic with Stepwise 
            **(Survived ~ Age + Pclass +  Fsize + FamilySize_dataSet + WomanChild12_1st)
            ** (accuracy in the train dataset = 0.833, Kaggle score = 0.78469)

Conclusions

I did the analysis of the Titanic data to predict who died and survived based on selected variables. The best models used variables as sex, age, class, family size, in dataset family size, title and the “children and women first” variable.

I am confortable to say that the gender, age and class are the major factors for survival or death in the Titanic tragedy. But also other factors interfered as the title, family size and embarked port.

The best models, acoording to the Kaggle score, are the 9 (decision tree) and 6 (random Forest), which land me on the top 28%. According to the training dataset, the accuracy for the random Forest is higher than the for the decision tree.

For future work, would be interesting evaluate the distance between the passenger’s cabin to the saving boats. The variable can be created using the ticket number to identify the cabin position in the ship and compute a vectorial distance. This is iteresting because, in a moment of desperate, people closer to the saving boats may had higher chances to survive. From the figures in the begging of this work, we see that first and second classes are closer to the life boats.

My Kaggle score in the Titanic Challenge

APPENDIX

set.seed(12345)

Model 1: Survived ~ Sex

Logistic (Accuracy : 0.787)

fit1.log <- glm(Survived ~ Sex , family = binomial(link='logit'), data = dt.train)
summary(fit1.log)
## 
## Call:
## glm(formula = Survived ~ Sex, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.826  -0.772   0.647   0.647   1.646  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.057      0.129   -8.19  2.6e-16 ***
## SexMale        2.514      0.167   15.04  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  917.8  on 889  degrees of freedom
## AIC: 921.8
## 
## Number of Fisher Scoring iterations: 4
dt.train$pred.fit1.log <- predict.glm(fit1.log, newdata = dt.train, type = "response")
dt.train$pred.fit1.log <- ifelse(dt.train$pred.fit1.log > 0.5,1,0)
dt.train$pred.fit1.log <- factor(dt.train$pred.fit1.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit1.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      233   81
##   Died          109  468
##                                         
##                Accuracy : 0.787         
##                  95% CI : (0.758, 0.813)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.542         
##  Mcnemar's Test P-Value : 0.0501        
##                                         
##             Sensitivity : 0.681         
##             Specificity : 0.852         
##          Pos Pred Value : 0.742         
##          Neg Pred Value : 0.811         
##              Prevalence : 0.384         
##          Detection Rate : 0.262         
##    Detection Prevalence : 0.352         
##       Balanced Accuracy : 0.767         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.787)

fit1.dt <- rpart(Survived ~ Sex, data=dt.train, method="class")
fancyRpartPlot(fit1.dt)

dt.train$pred.fit1.dt <- predict(fit1.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit1.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      233   81
##   Died          109  468
##                                         
##                Accuracy : 0.787         
##                  95% CI : (0.758, 0.813)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.542         
##  Mcnemar's Test P-Value : 0.0501        
##                                         
##             Sensitivity : 0.681         
##             Specificity : 0.852         
##          Pos Pred Value : 0.742         
##          Neg Pred Value : 0.811         
##              Prevalence : 0.384         
##          Detection Rate : 0.262         
##    Detection Prevalence : 0.352         
##       Balanced Accuracy : 0.767         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.787)

fit1.rf <- randomForest(Survived ~ Sex,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit1.rf <- predict(fit1.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit1.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      233   81
##   Died          109  468
##                                         
##                Accuracy : 0.787         
##                  95% CI : (0.758, 0.813)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.542         
##  Mcnemar's Test P-Value : 0.0501        
##                                         
##             Sensitivity : 0.681         
##             Specificity : 0.852         
##          Pos Pred Value : 0.742         
##          Neg Pred Value : 0.811         
##              Prevalence : 0.384         
##          Detection Rate : 0.262         
##    Detection Prevalence : 0.352         
##       Balanced Accuracy : 0.767         
##                                         
##        'Positive' Class : Survived      
## 

Model 2: Survived ~ Sex + Age

Logistic (Accuracy : 0.787)

fit2.log <- glm(Survived ~ Sex + Age , family = binomial(link='logit'), data = dt.train)
summary(fit2.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.933  -0.773   0.637   0.654   1.703  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.19213    0.21719   -5.49    4e-08 ***
## SexMale      2.50177    0.16769   14.92   <2e-16 ***
## Age          0.00489    0.00626    0.78     0.43    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  917.19  on 888  degrees of freedom
## AIC: 923.2
## 
## Number of Fisher Scoring iterations: 4
dt.train$pred.fit2.log <- predict.glm(fit2.log, newdata = dt.train, type = "response")
dt.train$pred.fit2.log <- ifelse(dt.train$pred.fit2.log > 0.5,1,0)
dt.train$pred.fit2.log <- factor(dt.train$pred.fit2.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit2.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      233   81
##   Died          109  468
##                                         
##                Accuracy : 0.787         
##                  95% CI : (0.758, 0.813)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.542         
##  Mcnemar's Test P-Value : 0.0501        
##                                         
##             Sensitivity : 0.681         
##             Specificity : 0.852         
##          Pos Pred Value : 0.742         
##          Neg Pred Value : 0.811         
##              Prevalence : 0.384         
##          Detection Rate : 0.262         
##    Detection Prevalence : 0.352         
##       Balanced Accuracy : 0.767         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.792)

fit2.dt <- rpart(Survived ~ Sex + Age, data=dt.train, method="class")
fancyRpartPlot(fit2.dt)

dt.train$pred.fit2.dt <- predict(fit2.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit2.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      256   99
##   Died           86  450
##                                         
##                Accuracy : 0.792         
##                  95% CI : (0.764, 0.819)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.564         
##  Mcnemar's Test P-Value : 0.378         
##                                         
##             Sensitivity : 0.749         
##             Specificity : 0.820         
##          Pos Pred Value : 0.721         
##          Neg Pred Value : 0.840         
##              Prevalence : 0.384         
##          Detection Rate : 0.287         
##    Detection Prevalence : 0.398         
##       Balanced Accuracy : 0.784         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.798)

fit2.rf <- randomForest(Survived ~ Sex + Age,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit2.rf <- predict(fit2.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit2.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      245   83
##   Died           97  466
##                                        
##                Accuracy : 0.798        
##                  95% CI : (0.77, 0.824)
##     No Information Rate : 0.616        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.57         
##  Mcnemar's Test P-Value : 0.333        
##                                        
##             Sensitivity : 0.716        
##             Specificity : 0.849        
##          Pos Pred Value : 0.747        
##          Neg Pred Value : 0.828        
##              Prevalence : 0.384        
##          Detection Rate : 0.275        
##    Detection Prevalence : 0.368        
##       Balanced Accuracy : 0.783        
##                                        
##        'Positive' Class : Survived     
## 
# Look at variable importance
varImpPlot(fit2.rf)

Model 3: Survived ~ Sex + Age + Pclass

Logistic (Accuracy : 0.796)

fit3.log <- glm(Survived ~ Sex + Age + Pclass, family = binomial(link='logit'), data = dt.train)
summary(fit3.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.452  -0.632   0.411   0.661   2.669  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.60274    0.36306   -9.92  < 2e-16 ***
## SexMale          2.58635    0.18665   13.86  < 2e-16 ***
## Age              0.03513    0.00734    4.78  1.7e-06 ***
## Pclass2nd Class  1.14469    0.25786    4.44  9.0e-06 ***
## Pclass3rd Class  2.39139    0.24501    9.76  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  802.66  on 886  degrees of freedom
## AIC: 812.7
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit3.log <- predict.glm(fit3.log, newdata = dt.train, type = "response")
dt.train$pred.fit3.log <- ifelse(dt.train$pred.fit3.log > 0.5,1,0)
dt.train$pred.fit3.log <- factor(dt.train$pred.fit3.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit3.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      236   76
##   Died          106  473
##                                         
##                Accuracy : 0.796         
##                  95% CI : (0.768, 0.822)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.561         
##  Mcnemar's Test P-Value : 0.0316        
##                                         
##             Sensitivity : 0.690         
##             Specificity : 0.862         
##          Pos Pred Value : 0.756         
##          Neg Pred Value : 0.817         
##              Prevalence : 0.384         
##          Detection Rate : 0.265         
##    Detection Prevalence : 0.350         
##       Balanced Accuracy : 0.776         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.82 )

fit3.dt <- rpart(Survived ~ Sex + Age + Pclass, data=dt.train, method="class")
fancyRpartPlot(fit3.dt)

dt.train$pred.fit3.dt <- predict(fit3.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit3.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      244   62
##   Died           98  487
##                                         
##                Accuracy : 0.82          
##                  95% CI : (0.794, 0.845)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.613         
##  Mcnemar's Test P-Value : 0.00566       
##                                         
##             Sensitivity : 0.713         
##             Specificity : 0.887         
##          Pos Pred Value : 0.797         
##          Neg Pred Value : 0.832         
##              Prevalence : 0.384         
##          Detection Rate : 0.274         
##    Detection Prevalence : 0.343         
##       Balanced Accuracy : 0.800         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.818)

fit3.rf <- randomForest(Survived ~ Sex + Age + Pclass,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit3.rf <- predict(fit3.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit3.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      239   55
##   Died          103  494
##                                         
##                Accuracy : 0.823         
##                  95% CI : (0.796, 0.847)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.615         
##  Mcnemar's Test P-Value : 0.000185      
##                                         
##             Sensitivity : 0.699         
##             Specificity : 0.900         
##          Pos Pred Value : 0.813         
##          Neg Pred Value : 0.827         
##              Prevalence : 0.384         
##          Detection Rate : 0.268         
##    Detection Prevalence : 0.330         
##       Balanced Accuracy : 0.799         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit3.rf)

## Model 4: Survived ~ Sex + Age + Pclass + SibSp ### Logistic (Accuracy : 0.79)

fit4.log <- glm(Survived ~ Sex + Age + Pclass + SibSp, family = binomial(link='logit'), data = dt.train)
summary(fit4.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + SibSp, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.485  -0.622   0.413   0.597   2.722  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.14231    0.40050  -10.34  < 2e-16 ***
## SexMale          2.71767    0.19423   13.99  < 2e-16 ***
## Age              0.04290    0.00783    5.48  4.2e-08 ***
## Pclass2nd Class  1.22563    0.26248    4.67  3.0e-06 ***
## Pclass3rd Class  2.42622    0.24720    9.81  < 2e-16 ***
## SibSp            0.37558    0.10354    3.63  0.00029 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  786.49  on 885  degrees of freedom
## AIC: 798.5
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit4.log <- predict.glm(fit4.log, newdata = dt.train, type = "response")
dt.train$pred.fit4.log <- ifelse(dt.train$pred.fit4.log > 0.5,1,0)
dt.train$pred.fit4.log <- factor(dt.train$pred.fit4.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit4.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      241   86
##   Died          101  463
##                                         
##                Accuracy : 0.79          
##                  95% CI : (0.762, 0.816)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.553         
##  Mcnemar's Test P-Value : 0.306         
##                                         
##             Sensitivity : 0.705         
##             Specificity : 0.843         
##          Pos Pred Value : 0.737         
##          Neg Pred Value : 0.821         
##              Prevalence : 0.384         
##          Detection Rate : 0.270         
##    Detection Prevalence : 0.367         
##       Balanced Accuracy : 0.774         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree( Accuracy : 0.835)

fit4.dt <- rpart(Survived ~ Sex + Age + Pclass + SibSp, data=dt.train, method="class")
fancyRpartPlot(fit4.dt)

dt.train$pred.fit4.dt <- predict(fit4.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit4.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   56
##   Died           91  493
##                                         
##                Accuracy : 0.835         
##                  95% CI : (0.809, 0.859)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.644         
##  Mcnemar's Test P-Value : 0.00504       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.898         
##          Pos Pred Value : 0.818         
##          Neg Pred Value : 0.844         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.345         
##       Balanced Accuracy : 0.816         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.857)

fit4.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit4.rf <- predict(fit4.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit4.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      253   40
##   Died           89  509
##                                        
##                Accuracy : 0.855        
##                  95% CI : (0.83, 0.878)
##     No Information Rate : 0.616        
##     P-Value [Acc > NIR] : < 2e-16      
##                                        
##                   Kappa : 0.685        
##  Mcnemar's Test P-Value : 2.38e-05     
##                                        
##             Sensitivity : 0.740        
##             Specificity : 0.927        
##          Pos Pred Value : 0.863        
##          Neg Pred Value : 0.851        
##              Prevalence : 0.384        
##          Detection Rate : 0.284        
##    Detection Prevalence : 0.329        
##       Balanced Accuracy : 0.833        
##                                        
##        'Positive' Class : Survived     
## 
# Look at variable importance
varImpPlot(fit4.rf)

Model 5: Survived ~ Sex + Age + Pclass + SibSp + Parch

Logistic (Accuracy : 0.793)

fit5.log <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch, family = binomial(link='logit'), data = dt.train)
summary(fit5.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + SibSp + Parch, 
##     family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.464  -0.617   0.415   0.601   2.691  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.18334    0.40573  -10.31  < 2e-16 ***
## SexMale          2.74277    0.19857   13.81  < 2e-16 ***
## Age              0.04310    0.00784    5.50  3.9e-08 ***
## Pclass2nd Class  1.22574    0.26245    4.67  3.0e-06 ***
## Pclass3rd Class  2.42559    0.24703    9.82  < 2e-16 ***
## SibSp            0.35442    0.10811    3.28    0.001 ** 
## Parch            0.07396    0.11512    0.64    0.521    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  786.08  on 884  degrees of freedom
## AIC: 800.1
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit5.log <- predict.glm(fit5.log, newdata = dt.train, type = "response")
dt.train$pred.fit5.log <- ifelse(dt.train$pred.fit5.log > 0.5,1,0)
dt.train$pred.fit5.log <- factor(dt.train$pred.fit5.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit5.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      242   84
##   Died          100  465
##                                        
##                Accuracy : 0.793        
##                  95% CI : (0.765, 0.82)
##     No Information Rate : 0.616        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.56         
##  Mcnemar's Test P-Value : 0.269        
##                                        
##             Sensitivity : 0.708        
##             Specificity : 0.847        
##          Pos Pred Value : 0.742        
##          Neg Pred Value : 0.823        
##              Prevalence : 0.384        
##          Detection Rate : 0.272        
##    Detection Prevalence : 0.366        
##       Balanced Accuracy : 0.777        
##                                        
##        'Positive' Class : Survived     
## 

Decision Tree ( Accuracy : 0.835)

fit5.dt <- rpart(Survived ~ Sex + Age + Pclass + SibSp + Parch, data=dt.train, method="class")
fancyRpartPlot(fit5.dt)

dt.train$pred.fit5.dt <- predict(fit5.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit5.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   56
##   Died           91  493
##                                         
##                Accuracy : 0.835         
##                  95% CI : (0.809, 0.859)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.644         
##  Mcnemar's Test P-Value : 0.00504       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.898         
##          Pos Pred Value : 0.818         
##          Neg Pred Value : 0.844         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.345         
##       Balanced Accuracy : 0.816         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.85)

fit5.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp + Parch,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit5.rf <- predict(fit5.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit5.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      255   41
##   Died           87  508
##                                         
##                Accuracy : 0.856         
##                  95% CI : (0.832, 0.879)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.688         
##  Mcnemar's Test P-Value : 6.97e-05      
##                                         
##             Sensitivity : 0.746         
##             Specificity : 0.925         
##          Pos Pred Value : 0.861         
##          Neg Pred Value : 0.854         
##              Prevalence : 0.384         
##          Detection Rate : 0.286         
##    Detection Prevalence : 0.332         
##       Balanced Accuracy : 0.835         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit5.rf)

Model 6: Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked

Logistic (Accuracy : 0.804)

fit6.log <- glm(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked, family = binomial(link='logit'), data = dt.train)
summary(fit6.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + SibSp + Parch + 
##     Embarked, family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.478  -0.624   0.411   0.601   2.611  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.40437    0.43072  -10.23  < 2e-16 ***
## SexMale          2.71165    0.20116   13.48  < 2e-16 ***
## Age              0.04199    0.00786    5.34  9.1e-08 ***
## Pclass2nd Class  1.08184    0.27060    4.00  6.4e-05 ***
## Pclass3rd Class  2.35279    0.25590    9.19  < 2e-16 ***
## SibSp            0.33133    0.10835    3.06   0.0022 ** 
## Parch            0.07130    0.11666    0.61   0.5411    
## EmbarkedQ        0.17881    0.38786    0.46   0.6448    
## EmbarkedS        0.47233    0.23583    2.00   0.0452 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  781.74  on 882  degrees of freedom
## AIC: 799.7
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit6.log <- predict.glm(fit6.log, newdata = dt.train, type = "response")
dt.train$pred.fit6.log <- ifelse(dt.train$pred.fit6.log > 0.5,1,0)
dt.train$pred.fit6.log <- factor(dt.train$pred.fit6.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit6.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      242   75
##   Died          100  474
##                                         
##                Accuracy : 0.804         
##                  95% CI : (0.776, 0.829)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.579         
##  Mcnemar's Test P-Value : 0.0696        
##                                         
##             Sensitivity : 0.708         
##             Specificity : 0.863         
##          Pos Pred Value : 0.763         
##          Neg Pred Value : 0.826         
##              Prevalence : 0.384         
##          Detection Rate : 0.272         
##    Detection Prevalence : 0.356         
##       Balanced Accuracy : 0.785         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.835)

fit6.dt <- rpart(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked, data=dt.train, method="class")
fancyRpartPlot(fit6.dt)

dt.train$pred.fit6.dt <- predict(fit6.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit6.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      222   27
##   Died          120  522
##                                         
##                Accuracy : 0.835         
##                  95% CI : (0.809, 0.859)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.632         
##  Mcnemar's Test P-Value : 3.25e-14      
##                                         
##             Sensitivity : 0.649         
##             Specificity : 0.951         
##          Pos Pred Value : 0.892         
##          Neg Pred Value : 0.813         
##              Prevalence : 0.384         
##          Detection Rate : 0.249         
##    Detection Prevalence : 0.279         
##       Balanced Accuracy : 0.800         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.869)

fit6.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit6.rf <- predict(fit6.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit6.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   25
##   Died           91  524
##                                         
##                Accuracy : 0.87          
##                  95% CI : (0.846, 0.891)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.714         
##  Mcnemar's Test P-Value : 1.59e-09      
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.954         
##          Pos Pred Value : 0.909         
##          Neg Pred Value : 0.852         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.310         
##       Balanced Accuracy : 0.844         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit6.rf)

Model 7: Survived ~ Sex + Age + Pclass + FamilySize

Logistic (Accuracy : 0.796)

fit7.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize, family = binomial(link='logit'), data = dt.train)
summary(fit7.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.437  -0.619   0.423   0.610   2.618  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.38689    0.43452  -10.10  < 2e-16 ***
## SexMale          2.76132    0.19773   13.97  < 2e-16 ***
## Age              0.04216    0.00782    5.39  7.0e-08 ***
## Pclass2nd Class  1.21007    0.26141    4.63  3.7e-06 ***
## Pclass3rd Class  2.41812    0.24652    9.81  < 2e-16 ***
## FamilySize       0.22745    0.06440    3.53  0.00041 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  788.52  on 885  degrees of freedom
## AIC: 800.5
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit7.log <- predict.glm(fit7.log, newdata = dt.train, type = "response")
dt.train$pred.fit7.log <- ifelse(dt.train$pred.fit7.log > 0.5,1,0)
dt.train$pred.fit7.log <- factor(dt.train$pred.fit7.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit7.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      243   83
##   Died           99  466
##                                         
##                Accuracy : 0.796         
##                  95% CI : (0.768, 0.822)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.564         
##  Mcnemar's Test P-Value : 0.266         
##                                         
##             Sensitivity : 0.711         
##             Specificity : 0.849         
##          Pos Pred Value : 0.745         
##          Neg Pred Value : 0.825         
##              Prevalence : 0.384         
##          Detection Rate : 0.273         
##    Detection Prevalence : 0.366         
##       Balanced Accuracy : 0.780         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree( Accuracy : 0.84)

fit7.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize, data=dt.train, method="class")
fancyRpartPlot(fit7.dt)

dt.train$pred.fit7.dt <- predict(fit7.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit7.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   52
##   Died           91  497
##                                         
##                Accuracy : 0.84          
##                  95% CI : (0.814, 0.863)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.653         
##  Mcnemar's Test P-Value : 0.00148       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.905         
##          Pos Pred Value : 0.828         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.340         
##       Balanced Accuracy : 0.820         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.848)

fit7.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit7.rf <- predict(fit7.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit7.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      254   48
##   Died           88  501
##                                        
##                Accuracy : 0.847        
##                  95% CI : (0.822, 0.87)
##     No Information Rate : 0.616        
##     P-Value [Acc > NIR] : < 2e-16      
##                                        
##                   Kappa : 0.67         
##  Mcnemar's Test P-Value : 0.000825     
##                                        
##             Sensitivity : 0.743        
##             Specificity : 0.913        
##          Pos Pred Value : 0.841        
##          Neg Pred Value : 0.851        
##              Prevalence : 0.384        
##          Detection Rate : 0.285        
##    Detection Prevalence : 0.339        
##       Balanced Accuracy : 0.828        
##                                        
##        'Positive' Class : Survived     
## 
# Look at variable importance
varImpPlot(fit7.rf)

Model 8: Survived ~ Sex + Age + Pclass + FamilySize + Embarked

Logistic (Accuracy : 0.806)

fit8.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize + Embarked , family = binomial(link='logit'), data = dt.train)
summary(fit8.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize + Embarked, 
##     family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.476  -0.616   0.416   0.634   2.539  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.61714    0.45979  -10.04  < 2e-16 ***
## SexMale          2.73222    0.20015   13.65  < 2e-16 ***
## Age              0.04115    0.00784    5.25  1.5e-07 ***
## Pclass2nd Class  1.06297    0.26932    3.95  7.9e-05 ***
## Pclass3rd Class  2.34208    0.25510    9.18  < 2e-16 ***
## FamilySize       0.21439    0.06541    3.28    0.001 ** 
## EmbarkedQ        0.21421    0.38624    0.55    0.579    
## EmbarkedS        0.49533    0.23512    2.11    0.035 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  783.82  on 883  degrees of freedom
## AIC: 799.8
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit8.log <- predict.glm(fit8.log, newdata = dt.train, type = "response")
dt.train$pred.fit8.log <- ifelse(dt.train$pred.fit8.log > 0.5,1,0)
dt.train$pred.fit8.log <- factor(dt.train$pred.fit8.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit8.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      241   72
##   Died          101  477
##                                         
##                Accuracy : 0.806         
##                  95% CI : (0.778, 0.831)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.583         
##  Mcnemar's Test P-Value : 0.0333        
##                                         
##             Sensitivity : 0.705         
##             Specificity : 0.869         
##          Pos Pred Value : 0.770         
##          Neg Pred Value : 0.825         
##              Prevalence : 0.384         
##          Detection Rate : 0.270         
##    Detection Prevalence : 0.351         
##       Balanced Accuracy : 0.787         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.84)

fit8.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize + Embarked , data=dt.train, method="class")
fancyRpartPlot(fit8.dt)

dt.train$pred.fit8.dt <- predict(fit8.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit8.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   52
##   Died           91  497
##                                         
##                Accuracy : 0.84          
##                  95% CI : (0.814, 0.863)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.653         
##  Mcnemar's Test P-Value : 0.00148       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.905         
##          Pos Pred Value : 0.828         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.340         
##       Balanced Accuracy : 0.820         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.864)

fit8.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize + Embarked ,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit8.rf <- predict(fit8.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit8.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      242   20
##   Died          100  529
##                                         
##                Accuracy : 0.865         
##                  95% CI : (0.841, 0.887)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.702         
##  Mcnemar's Test P-Value : 5.53e-13      
##                                         
##             Sensitivity : 0.708         
##             Specificity : 0.964         
##          Pos Pred Value : 0.924         
##          Neg Pred Value : 0.841         
##              Prevalence : 0.384         
##          Detection Rate : 0.272         
##    Detection Prevalence : 0.294         
##       Balanced Accuracy : 0.836         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit8.rf)

Model 9: Survived ~ Sex + Age + Pclass + Fsize

Logistic (Accuracy : 0.826)

fit9.log <- glm(Survived ~ Sex + Age + Pclass + Fsize, family = binomial(link='logit'), data = dt.train)
summary(fit9.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.631  -0.585   0.424   0.608   2.932  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.88104    0.42283   -9.18  < 2e-16 ***
## SexMale          2.75420    0.20286   13.58  < 2e-16 ***
## Age              0.03953    0.00812    4.87  1.1e-06 ***
## Pclass2nd Class  1.29046    0.26906    4.80  1.6e-06 ***
## Pclass3rd Class  2.30435    0.25311    9.10  < 2e-16 ***
## Fsize2          -0.03094    0.24242   -0.13    0.898    
## Fsize3          -0.53537    0.28320   -1.89    0.059 .  
## Fsize4          -0.48232    0.53810   -0.90    0.370    
## Fsize5+          2.13354    0.44171    4.83  1.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  763.82  on 882  degrees of freedom
## AIC: 781.8
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit9.log <- predict.glm(fit9.log, newdata = dt.train, type = "response")
dt.train$pred.fit9.log <- ifelse(dt.train$pred.fit9.log > 0.5,1,0)
dt.train$pred.fit9.log <- factor(dt.train$pred.fit9.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit9.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   64
##   Died           91  485
##                                      
##                Accuracy : 0.826      
##                  95% CI : (0.8, 0.85)
##     No Information Rate : 0.616      
##     P-Value [Acc > NIR] : <2e-16     
##                                      
##                   Kappa : 0.627      
##  Mcnemar's Test P-Value : 0.0368     
##                                      
##             Sensitivity : 0.734      
##             Specificity : 0.883      
##          Pos Pred Value : 0.797      
##          Neg Pred Value : 0.842      
##              Prevalence : 0.384      
##          Detection Rate : 0.282      
##    Detection Prevalence : 0.354      
##       Balanced Accuracy : 0.809      
##                                      
##        'Positive' Class : Survived   
## 

Decision Tree ( Accuracy : 0.841)

fit9.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize, data=dt.train, method="class")
fancyRpartPlot(fit9.dt)

dt.train$pred.fit9.dt <- predict(fit9.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit9.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   51
##   Died           91  498
##                                         
##                Accuracy : 0.841         
##                  95% CI : (0.815, 0.864)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.655         
##  Mcnemar's Test P-Value : 0.00106       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.907         
##          Pos Pred Value : 0.831         
##          Neg Pred Value : 0.846         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.339         
##       Balanced Accuracy : 0.821         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.85)

fit9.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit9.rf <- predict(fit9.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit9.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      254   45
##   Died           88  504
##                                         
##                Accuracy : 0.851         
##                  95% CI : (0.826, 0.873)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.677         
##  Mcnemar's Test P-Value : 0.000271      
##                                         
##             Sensitivity : 0.743         
##             Specificity : 0.918         
##          Pos Pred Value : 0.849         
##          Neg Pred Value : 0.851         
##              Prevalence : 0.384         
##          Detection Rate : 0.285         
##    Detection Prevalence : 0.336         
##       Balanced Accuracy : 0.830         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit9.rf)

Model 10: Survived ~ Sex + Age + Pclass + Fsize + Embarked

Logistic (Accuracy : 0.819)

fit10.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + Embarked , family = binomial(link='logit'), data = dt.train)
summary(fit10.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + Embarked, 
##     family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.637  -0.571   0.407   0.615   2.873  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.05512    0.45159   -8.98  < 2e-16 ***
## SexMale          2.72740    0.20585   13.25  < 2e-16 ***
## Age              0.03871    0.00812    4.77  1.9e-06 ***
## Pclass2nd Class  1.17911    0.27755    4.25  2.2e-05 ***
## Pclass3rd Class  2.25785    0.26128    8.64  < 2e-16 ***
## Fsize2          -0.00467    0.24529   -0.02    0.985    
## Fsize3          -0.51743    0.28421   -1.82    0.069 .  
## Fsize4          -0.48744    0.54125   -0.90    0.368    
## Fsize5+          2.04438    0.44870    4.56  5.2e-06 ***
## EmbarkedQ        0.08591    0.39529    0.22    0.828    
## EmbarkedS        0.35531    0.24094    1.47    0.140    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  761.31  on 880  degrees of freedom
## AIC: 783.3
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit10.log <- predict.glm(fit10.log, newdata = dt.train, type = "response")
dt.train$pred.fit10.log <- ifelse(dt.train$pred.fit10.log > 0.5,1,0)
dt.train$pred.fit10.log <- factor(dt.train$pred.fit10.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit10.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      249   68
##   Died           93  481
##                                         
##                Accuracy : 0.819         
##                  95% CI : (0.792, 0.844)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.613         
##  Mcnemar's Test P-Value : 0.0586        
##                                         
##             Sensitivity : 0.728         
##             Specificity : 0.876         
##          Pos Pred Value : 0.785         
##          Neg Pred Value : 0.838         
##              Prevalence : 0.384         
##          Detection Rate : 0.279         
##    Detection Prevalence : 0.356         
##       Balanced Accuracy : 0.802         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.841)

fit10.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + Embarked , data=dt.train, method="class")
fancyRpartPlot(fit10.dt)

dt.train$pred.fit10.dt <- predict(fit10.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit10.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   51
##   Died           91  498
##                                         
##                Accuracy : 0.841         
##                  95% CI : (0.815, 0.864)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.655         
##  Mcnemar's Test P-Value : 0.00106       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.907         
##          Pos Pred Value : 0.831         
##          Neg Pred Value : 0.846         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.339         
##       Balanced Accuracy : 0.821         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.868)

fit10.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + Embarked ,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit10.rf <- predict(fit10.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit10.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      245   20
##   Died           97  529
##                                        
##                Accuracy : 0.869        
##                  95% CI : (0.845, 0.89)
##     No Information Rate : 0.616        
##     P-Value [Acc > NIR] : < 2e-16      
##                                        
##                   Kappa : 0.71         
##  Mcnemar's Test P-Value : 2.12e-12     
##                                        
##             Sensitivity : 0.716        
##             Specificity : 0.964        
##          Pos Pred Value : 0.925        
##          Neg Pred Value : 0.845        
##              Prevalence : 0.384        
##          Detection Rate : 0.275        
##    Detection Prevalence : 0.297        
##       Balanced Accuracy : 0.840        
##                                        
##        'Positive' Class : Survived     
## 
# Look at variable importance
varImpPlot(fit10.rf)

Model 11: Survived ~ Sex + Age + Pclass + FamilySize_dataSet

Logistic (Accuracy : 0.792)

fit11.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize_dataSet, family = binomial(link='logit'), data = dt.train)
summary(fit11.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize_dataSet, 
##     family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.451  -0.586   0.424   0.589   2.618  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -4.68388    0.45075  -10.39  < 2e-16 ***
## SexMale             2.77868    0.19746   14.07  < 2e-16 ***
## Age                 0.04528    0.00803    5.64  1.7e-08 ***
## Pclass2nd Class     1.26631    0.26527    4.77  1.8e-06 ***
## Pclass3rd Class     2.42174    0.24821    9.76  < 2e-16 ***
## FamilySize_dataSet  0.39954    0.08992    4.44  8.9e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  779.01  on 885  degrees of freedom
## AIC: 791
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit11.log <- predict.glm(fit11.log, newdata = dt.train, type = "response")
dt.train$pred.fit11.log <- ifelse(dt.train$pred.fit11.log > 0.5,1,0)
dt.train$pred.fit11.log <- factor(dt.train$pred.fit11.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit11.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      239   82
##   Died          103  467
##                                         
##                Accuracy : 0.792         
##                  95% CI : (0.764, 0.819)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.556         
##  Mcnemar's Test P-Value : 0.141         
##                                         
##             Sensitivity : 0.699         
##             Specificity : 0.851         
##          Pos Pred Value : 0.745         
##          Neg Pred Value : 0.819         
##              Prevalence : 0.384         
##          Detection Rate : 0.268         
##    Detection Prevalence : 0.360         
##       Balanced Accuracy : 0.775         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree( Accuracy : 0.832)

fit11.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize_dataSet, data=dt.train, method="class")
fancyRpartPlot(fit11.dt)

dt.train$pred.fit11.dt <- predict(fit11.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit11.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      253   61
##   Died           89  488
##                                         
##                Accuracy : 0.832         
##                  95% CI : (0.805, 0.856)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.639         
##  Mcnemar's Test P-Value : 0.0275        
##                                         
##             Sensitivity : 0.740         
##             Specificity : 0.889         
##          Pos Pred Value : 0.806         
##          Neg Pred Value : 0.846         
##              Prevalence : 0.384         
##          Detection Rate : 0.284         
##    Detection Prevalence : 0.352         
##       Balanced Accuracy : 0.814         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.846)

fit11.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize_dataSet,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit11.rf <- predict(fit11.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit11.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      255   48
##   Died           87  501
##                                         
##                Accuracy : 0.848         
##                  95% CI : (0.823, 0.871)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.673         
##  Mcnemar's Test P-Value : 0.00107       
##                                         
##             Sensitivity : 0.746         
##             Specificity : 0.913         
##          Pos Pred Value : 0.842         
##          Neg Pred Value : 0.852         
##              Prevalence : 0.384         
##          Detection Rate : 0.286         
##    Detection Prevalence : 0.340         
##       Balanced Accuracy : 0.829         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit11.rf)

Model 12: Survived ~ Sex + Age + Pclass + FamilySize_dataSet + Embarked

Logistic (Accuracy : 0.81)

fit12.log <- glm(Survived ~ Sex + Age + Pclass + FamilySize_dataSet + Embarked , family = binomial(link='logit'), data = dt.train)
summary(fit12.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + FamilySize_dataSet + 
##     Embarked, family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.486  -0.590   0.421   0.596   2.542  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -4.89675    0.47318  -10.35  < 2e-16 ***
## SexMale             2.75329    0.19979   13.78  < 2e-16 ***
## Age                 0.04422    0.00804    5.50  3.8e-08 ***
## Pclass2nd Class     1.12520    0.27324    4.12  3.8e-05 ***
## Pclass3rd Class     2.34726    0.25646    9.15  < 2e-16 ***
## FamilySize_dataSet  0.38359    0.09130    4.20  2.7e-05 ***
## EmbarkedQ           0.22032    0.38836    0.57    0.570    
## EmbarkedS           0.46687    0.23528    1.98    0.047 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  774.9  on 883  degrees of freedom
## AIC: 790.9
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit12.log <- predict.glm(fit12.log, newdata = dt.train, type = "response")
dt.train$pred.fit12.log <- ifelse(dt.train$pred.fit12.log > 0.5,1,0)
dt.train$pred.fit12.log <- factor(dt.train$pred.fit12.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit12.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      243   70
##   Died           99  479
##                                         
##                Accuracy : 0.81          
##                  95% CI : (0.783, 0.836)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.592         
##  Mcnemar's Test P-Value : 0.0313        
##                                         
##             Sensitivity : 0.711         
##             Specificity : 0.872         
##          Pos Pred Value : 0.776         
##          Neg Pred Value : 0.829         
##              Prevalence : 0.384         
##          Detection Rate : 0.273         
##    Detection Prevalence : 0.351         
##       Balanced Accuracy : 0.792         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.835)

fit12.dt <- rpart(Survived ~ Sex + Age + Pclass + FamilySize_dataSet + Embarked , data=dt.train, method="class")
fancyRpartPlot(fit12.dt)

dt.train$pred.fit12.dt <- predict(fit12.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit12.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      221   26
##   Died          121  523
##                                         
##                Accuracy : 0.835         
##                  95% CI : (0.809, 0.859)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.632         
##  Mcnemar's Test P-Value : 8.98e-15      
##                                         
##             Sensitivity : 0.646         
##             Specificity : 0.953         
##          Pos Pred Value : 0.895         
##          Neg Pred Value : 0.812         
##              Prevalence : 0.384         
##          Detection Rate : 0.248         
##    Detection Prevalence : 0.277         
##       Balanced Accuracy : 0.799         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.864)

fit12.rf <- randomForest(Survived ~ Sex + Age + Pclass + FamilySize_dataSet + Embarked ,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit12.rf <- predict(fit12.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit12.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      244   24
##   Died           98  525
##                                         
##                Accuracy : 0.863         
##                  95% CI : (0.839, 0.885)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.698         
##  Mcnemar's Test P-Value : 3.87e-11      
##                                         
##             Sensitivity : 0.713         
##             Specificity : 0.956         
##          Pos Pred Value : 0.910         
##          Neg Pred Value : 0.843         
##              Prevalence : 0.384         
##          Detection Rate : 0.274         
##    Detection Prevalence : 0.301         
##       Balanced Accuracy : 0.835         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit12.rf)

Model 13: Survived ~ Sex + Age + Pclass + Fsize + NewTitle

Logistic (Accuracy : 0.833)

fit13.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + NewTitle, family = binomial(link='logit'), data = dt.train)
summary(fit13.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + NewTitle, 
##     family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.415  -0.522   0.400   0.535   2.697  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -14.84669  535.41138   -0.03   0.9779    
## SexMale          14.32306  535.41150    0.03   0.9787    
## Age               0.02614    0.00968    2.70   0.0069 ** 
## Pclass2nd Class   1.44925    0.29247    4.96  7.2e-07 ***
## Pclass3rd Class   2.38485    0.26895    8.87  < 2e-16 ***
## Fsize2            0.30835    0.26943    1.14   0.2524    
## Fsize3            0.17462    0.32732    0.53   0.5937    
## Fsize4            0.06919    0.58242    0.12   0.9054    
## Fsize5+           2.89729    0.46305    6.26  3.9e-10 ***
## NewTitleMrs      10.51332  535.41131    0.02   0.9843    
## NewTitleMaster   -3.57631    0.85033   -4.21  2.6e-05 ***
## NewTitleMiss     11.19556  535.41128    0.02   0.9833    
## NewTitleMr       -0.17772    0.61624   -0.29   0.7731    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  717.63  on 878  degrees of freedom
## AIC: 743.6
## 
## Number of Fisher Scoring iterations: 12
dt.train$pred.fit13.log <- predict.glm(fit13.log, newdata = dt.train, type = "response")
dt.train$pred.fit13.log <- ifelse(dt.train$pred.fit13.log > 0.5,1,0)
dt.train$pred.fit13.log <- factor(dt.train$pred.fit13.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit13.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   59
##   Died           90  490
##                                         
##                Accuracy : 0.833         
##                  95% CI : (0.807, 0.857)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.64          
##  Mcnemar's Test P-Value : 0.014         
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.893         
##          Pos Pred Value : 0.810         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.349         
##       Balanced Accuracy : 0.815         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree( Accuracy : 0.834)

fit13.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + NewTitle, data=dt.train, method="class")
fancyRpartPlot(fit13.dt)

dt.train$pred.fit13.dt <- predict(fit13.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit13.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   57
##   Died           91  492
##                                         
##                Accuracy : 0.834         
##                  95% CI : (0.808, 0.858)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.642         
##  Mcnemar's Test P-Value : 0.00668       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.896         
##          Pos Pred Value : 0.815         
##          Neg Pred Value : 0.844         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.346         
##       Balanced Accuracy : 0.815         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.852)

fit13.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + NewTitle,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit13.rf <- predict(fit13.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit13.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      253   47
##   Died           89  502
##                                        
##                Accuracy : 0.847        
##                  95% CI : (0.822, 0.87)
##     No Information Rate : 0.616        
##     P-Value [Acc > NIR] : < 2e-16      
##                                        
##                   Kappa : 0.67         
##  Mcnemar's Test P-Value : 0.000439     
##                                        
##             Sensitivity : 0.740        
##             Specificity : 0.914        
##          Pos Pred Value : 0.843        
##          Neg Pred Value : 0.849        
##              Prevalence : 0.384        
##          Detection Rate : 0.284        
##    Detection Prevalence : 0.337        
##       Balanced Accuracy : 0.827        
##                                        
##        'Positive' Class : Survived     
## 
# Look at variable importance
varImpPlot(fit13.rf)

Model 14: Survived ~ Sex + Age + Pclass + Fsize + NewTitle + Embarked

Logistic (Accuracy : 0.831)

fit14.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + NewTitle + Embarked, family = binomial(link='logit'), data = dt.train)
summary(fit14.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + NewTitle + 
##     Embarked, family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.449  -0.530   0.385   0.525   2.636  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -15.2108   535.4114   -0.03   0.9773    
## SexMale          14.5116   535.4115    0.03   0.9784    
## Age               0.0254     0.0097    2.62   0.0087 ** 
## Pclass2nd Class   1.3275     0.3010    4.41  1.0e-05 ***
## Pclass3rd Class   2.3445     0.2791    8.40  < 2e-16 ***
## Fsize2            0.3494     0.2729    1.28   0.2004    
## Fsize3            0.1973     0.3279    0.60   0.5473    
## Fsize4            0.0984     0.5861    0.17   0.8667    
## Fsize5+           2.8253     0.4690    6.02  1.7e-09 ***
## NewTitleMrs      10.6355   535.4113    0.02   0.9842    
## NewTitleMaster   -3.6220     0.8553   -4.23  2.3e-05 ***
## NewTitleMiss     11.3698   535.4113    0.02   0.9831    
## NewTitleMr       -0.2406     0.6230   -0.39   0.6994    
## EmbarkedQ         0.0618     0.3994    0.15   0.8771    
## EmbarkedS         0.3981     0.2508    1.59   0.1125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  714.55  on 876  degrees of freedom
## AIC: 744.5
## 
## Number of Fisher Scoring iterations: 12
dt.train$pred.fit14.log <- predict.glm(fit14.log, newdata = dt.train, type = "response")
dt.train$pred.fit14.log <- ifelse(dt.train$pred.fit14.log > 0.5,1,0)
dt.train$pred.fit14.log <- factor(dt.train$pred.fit14.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit14.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      257   66
##   Died           85  483
##                                         
##                Accuracy : 0.831         
##                  95% CI : (0.804, 0.855)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.638         
##  Mcnemar's Test P-Value : 0.143         
##                                         
##             Sensitivity : 0.751         
##             Specificity : 0.880         
##          Pos Pred Value : 0.796         
##          Neg Pred Value : 0.850         
##              Prevalence : 0.384         
##          Detection Rate : 0.288         
##    Detection Prevalence : 0.363         
##       Balanced Accuracy : 0.816         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.834)

fit14.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + NewTitle + Embarked, data=dt.train, method="class")
fancyRpartPlot(fit14.dt)

dt.train$pred.fit14.dt <- predict(fit14.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit14.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   57
##   Died           91  492
##                                         
##                Accuracy : 0.834         
##                  95% CI : (0.808, 0.858)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.642         
##  Mcnemar's Test P-Value : 0.00668       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.896         
##          Pos Pred Value : 0.815         
##          Neg Pred Value : 0.844         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.346         
##       Balanced Accuracy : 0.815         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.861)

fit14.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + NewTitle + Embarked,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit14.rf <- predict(fit14.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit14.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      244   26
##   Died           98  523
##                                         
##                Accuracy : 0.861         
##                  95% CI : (0.836, 0.883)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.694         
##  Mcnemar's Test P-Value : 1.82e-10      
##                                         
##             Sensitivity : 0.713         
##             Specificity : 0.953         
##          Pos Pred Value : 0.904         
##          Neg Pred Value : 0.842         
##              Prevalence : 0.384         
##          Detection Rate : 0.274         
##    Detection Prevalence : 0.303         
##       Balanced Accuracy : 0.833         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit14.rf)

## Model 15: Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st ### Logistic (Accuracy : 0.834)

fit15.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st , family = binomial(link='logit'), data = dt.train)
summary(fit15.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st, 
##     family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.397  -0.550   0.396   0.557   2.658  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -3.44353    0.60498   -5.69  1.3e-08 ***
## SexMale               -0.43497    0.65497   -0.66    0.507    
## Age                    0.02044    0.00932    2.19    0.028 *  
## Pclass2nd Class        1.34719    0.28247    4.77  1.8e-06 ***
## Pclass3rd Class        2.31258    0.26417    8.75  < 2e-16 ***
## Fsize2                 0.15872    0.25407    0.62    0.532    
## Fsize3                 0.02978    0.31829    0.09    0.925    
## Fsize4                -0.10135    0.57676   -0.18    0.861    
## Fsize5+                2.80083    0.47454    5.90  3.6e-09 ***
## WomanChild12_1stWomen -0.17940    0.55928   -0.32    0.748    
## WomanChild12_1stMen    3.46207    0.59717    5.80  6.7e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.7  on 890  degrees of freedom
## Residual deviance:  721.4  on 880  degrees of freedom
## AIC: 743.4
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit15.log <- predict.glm(fit15.log, newdata = dt.train, type = "response")
dt.train$pred.fit15.log <- ifelse(dt.train$pred.fit15.log > 0.5,1,0)
dt.train$pred.fit15.log <- factor(dt.train$pred.fit15.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit15.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   57
##   Died           91  492
##                                         
##                Accuracy : 0.834         
##                  95% CI : (0.808, 0.858)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.642         
##  Mcnemar's Test P-Value : 0.00668       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.896         
##          Pos Pred Value : 0.815         
##          Neg Pred Value : 0.844         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.346         
##       Balanced Accuracy : 0.815         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.835)

fit15.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st , data=dt.train, method="class")
fancyRpartPlot(fit15.dt)

dt.train$pred.fit15.dt <- predict(fit15.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit15.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   57
##   Died           90  492
##                                         
##                Accuracy : 0.835         
##                  95% CI : (0.809, 0.859)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.645         
##  Mcnemar's Test P-Value : 0.00831       
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.896         
##          Pos Pred Value : 0.816         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.347         
##       Balanced Accuracy : 0.817         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.844)

fit15.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st ,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit15.rf <- predict(fit15.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit15.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   49
##   Died           90  500
##                                         
##                Accuracy : 0.844         
##                  95% CI : (0.818, 0.867)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.663         
##  Mcnemar's Test P-Value : 0.000692      
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.911         
##          Pos Pred Value : 0.837         
##          Neg Pred Value : 0.847         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.338         
##       Balanced Accuracy : 0.824         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit15.rf)

Model 16: Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + Embarked

Logistic (Accuracy : 0.827)

fit16.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + Embarked, family = binomial(link='logit'), data = dt.train)
summary(fit16.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + 
##     Embarked, family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.421  -0.553   0.381   0.544   2.580  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -3.65849    0.62451   -5.86  4.7e-09 ***
## SexMale               -0.43462    0.65824   -0.66     0.51    
## Age                    0.01923    0.00936    2.05     0.04 *  
## Pclass2nd Class        1.22344    0.29170    4.19  2.7e-05 ***
## Pclass3rd Class        2.24714    0.27248    8.25  < 2e-16 ***
## Fsize2                 0.19485    0.25774    0.76     0.45    
## Fsize3                 0.05615    0.31922    0.18     0.86    
## Fsize4                -0.05728    0.57910   -0.10     0.92    
## Fsize5+                2.74206    0.47945    5.72  1.1e-08 ***
## WomanChild12_1stWomen -0.15633    0.56590   -0.28     0.78    
## WomanChild12_1stMen    3.47088    0.59879    5.80  6.8e-09 ***
## EmbarkedQ              0.19032    0.40385    0.47     0.64    
## EmbarkedS              0.38593    0.24888    1.55     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  718.87  on 878  degrees of freedom
## AIC: 744.9
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit16.log <- predict.glm(fit16.log, newdata = dt.train, type = "response")
dt.train$pred.fit16.log <- ifelse(dt.train$pred.fit16.log > 0.5,1,0)
dt.train$pred.fit16.log <- factor(dt.train$pred.fit16.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit16.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      254   66
##   Died           88  483
##                                         
##                Accuracy : 0.827         
##                  95% CI : (0.801, 0.851)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.63          
##  Mcnemar's Test P-Value : 0.0906        
##                                         
##             Sensitivity : 0.743         
##             Specificity : 0.880         
##          Pos Pred Value : 0.794         
##          Neg Pred Value : 0.846         
##              Prevalence : 0.384         
##          Detection Rate : 0.285         
##    Detection Prevalence : 0.359         
##       Balanced Accuracy : 0.811         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree( Accuracy : 0.835)

fit16.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + Embarked, data=dt.train, method="class")
fancyRpartPlot(fit16.dt)

dt.train$pred.fit16.dt <- predict(fit16.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit16.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   57
##   Died           90  492
##                                         
##                Accuracy : 0.835         
##                  95% CI : (0.809, 0.859)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.645         
##  Mcnemar's Test P-Value : 0.00831       
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.896         
##          Pos Pred Value : 0.816         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.347         
##       Balanced Accuracy : 0.817         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.860)

fit16.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + WomanChild12_1st + Embarked,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit16.rf <- predict(fit16.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit16.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      246   28
##   Died           96  521
##                                         
##                Accuracy : 0.861         
##                  95% CI : (0.836, 0.883)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.694         
##  Mcnemar's Test P-Value : 1.78e-09      
##                                         
##             Sensitivity : 0.719         
##             Specificity : 0.949         
##          Pos Pred Value : 0.898         
##          Neg Pred Value : 0.844         
##              Prevalence : 0.384         
##          Detection Rate : 0.276         
##    Detection Prevalence : 0.308         
##       Balanced Accuracy : 0.834         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit16.rf)

Model 17: Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st

Logistic (Accuracy : 0.832)

fit17.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st , family = binomial(link='logit'), data = dt.train)
summary(fit17.log)
## 
## Call:
## glm(formula = Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st, 
##     family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.405  -0.549   0.394   0.570   2.667  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -3.43792    0.56661   -6.07  1.3e-09 ***
## SexMale               -0.19491    0.61333   -0.32    0.751    
## Age                    0.02125    0.00936    2.27    0.023 *  
## Pclass2nd Class        1.34290    0.28191    4.76  1.9e-06 ***
## Pclass3rd Class        2.31714    0.26384    8.78  < 2e-16 ***
## Fsize2                 0.13801    0.25307    0.55    0.586    
## Fsize3                -0.01202    0.31383   -0.04    0.969    
## Fsize4                -0.13071    0.57105   -0.23    0.819    
## Fsize5+                2.73261    0.46133    5.92  3.2e-09 ***
## WomanChild14_1stWomen -0.19292    0.51626   -0.37    0.709    
## WomanChild14_1stMen    3.19351    0.56586    5.64  1.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  724.82  on 880  degrees of freedom
## AIC: 746.8
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit17.log <- predict.glm(fit17.log, newdata = dt.train, type = "response")
dt.train$pred.fit17.log <- ifelse(dt.train$pred.fit17.log > 0.5,1,0)
dt.train$pred.fit17.log <- factor(dt.train$pred.fit17.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit17.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      250   58
##   Died           92  491
##                                         
##                Accuracy : 0.832         
##                  95% CI : (0.805, 0.856)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.637         
##  Mcnemar's Test P-Value : 0.00705       
##                                         
##             Sensitivity : 0.731         
##             Specificity : 0.894         
##          Pos Pred Value : 0.812         
##          Neg Pred Value : 0.842         
##              Prevalence : 0.384         
##          Detection Rate : 0.281         
##    Detection Prevalence : 0.346         
##       Balanced Accuracy : 0.813         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.834)

fit17.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st , data=dt.train, method="class")
fancyRpartPlot(fit17.dt)

dt.train$pred.fit17.dt <- predict(fit17.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit17.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   58
##   Died           90  491
##                                         
##                Accuracy : 0.834         
##                  95% CI : (0.808, 0.858)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.643         
##  Mcnemar's Test P-Value : 0.0108        
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.894         
##          Pos Pred Value : 0.813         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.348         
##       Balanced Accuracy : 0.816         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.844)

fit17.rf <- randomForest(Survived ~ Sex + Age + Pclass + Fsize + WomanChild14_1st,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit17.rf <- predict(fit17.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit17.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   49
##   Died           90  500
##                                         
##                Accuracy : 0.844         
##                  95% CI : (0.818, 0.867)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.663         
##  Mcnemar's Test P-Value : 0.000692      
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.911         
##          Pos Pred Value : 0.837         
##          Neg Pred Value : 0.847         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.338         
##       Balanced Accuracy : 0.824         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit17.rf)

Model 18: Survived ~ Pclass + Fsize + WomanChild14_1st

Logistic (Accuracy : 0.832)

fit18.log <- glm(Survived ~ Pclass + Fsize + WomanChild14_1st , family = binomial(link='logit'), data = dt.train)
summary(fit18.log)
## 
## Call:
## glm(formula = Survived ~ Pclass + Fsize + WomanChild14_1st, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.320  -0.596   0.399   0.605   2.609  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -3.19613    0.46206   -6.92  4.6e-12 ***
## Pclass2nd Class        1.20572    0.27350    4.41  1.0e-05 ***
## Pclass3rd Class        2.09045    0.24112    8.67  < 2e-16 ***
## Fsize2                 0.13147    0.25272    0.52     0.60    
## Fsize3                 0.00371    0.31273    0.01     0.99    
## Fsize4                -0.17321    0.57068   -0.30     0.76    
## Fsize5+                2.69648    0.44581    6.05  1.5e-09 ***
## WomanChild14_1stWomen  0.35112    0.37603    0.93     0.35    
## WomanChild14_1stMen    3.59545    0.40604    8.85  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  730.26  on 882  degrees of freedom
## AIC: 748.3
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit18.log <- predict.glm(fit18.log, newdata = dt.train, type = "response")
dt.train$pred.fit18.log <- ifelse(dt.train$pred.fit18.log > 0.5,1,0)
dt.train$pred.fit18.log <- factor(dt.train$pred.fit18.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit18.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      250   58
##   Died           92  491
##                                         
##                Accuracy : 0.832         
##                  95% CI : (0.805, 0.856)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.637         
##  Mcnemar's Test P-Value : 0.00705       
##                                         
##             Sensitivity : 0.731         
##             Specificity : 0.894         
##          Pos Pred Value : 0.812         
##          Neg Pred Value : 0.842         
##              Prevalence : 0.384         
##          Detection Rate : 0.281         
##    Detection Prevalence : 0.346         
##       Balanced Accuracy : 0.813         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.834)

fit18.dt <- rpart(Survived ~ Pclass + Fsize + WomanChild14_1st , data=dt.train, method="class")
fancyRpartPlot(fit18.dt)

dt.train$pred.fit18.dt <- predict(fit18.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit18.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   58
##   Died           90  491
##                                         
##                Accuracy : 0.834         
##                  95% CI : (0.808, 0.858)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.643         
##  Mcnemar's Test P-Value : 0.0108        
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.894         
##          Pos Pred Value : 0.813         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.348         
##       Balanced Accuracy : 0.816         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.834)

fit18.rf <- randomForest(Survived ~ Pclass + Fsize + WomanChild14_1st,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit18.rf <- predict(fit18.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit18.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      250   56
##   Died           92  493
##                                         
##                Accuracy : 0.834         
##                  95% CI : (0.808, 0.858)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.642         
##  Mcnemar's Test P-Value : 0.00401       
##                                         
##             Sensitivity : 0.731         
##             Specificity : 0.898         
##          Pos Pred Value : 0.817         
##          Neg Pred Value : 0.843         
##              Prevalence : 0.384         
##          Detection Rate : 0.281         
##    Detection Prevalence : 0.343         
##       Balanced Accuracy : 0.814         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit18.rf)

Model 19: Survived ~ Pclass + Fsize + WomanChild12_1st

Logistic (Accuracy : 0.833)

fit19.log <- glm(Survived ~ Pclass + Fsize + WomanChild12_1st , family = binomial(link='logit'), data = dt.train)
summary(fit19.log)
## 
## Call:
## glm(formula = Survived ~ Pclass + Fsize + WomanChild12_1st, family = binomial(link = "logit"), 
##     data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.325  -0.591   0.401   0.605   2.649  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -3.3641     0.4922   -6.84  8.2e-12 ***
## Pclass2nd Class         1.2149     0.2742    4.43  9.4e-06 ***
## Pclass3rd Class         2.0923     0.2413    8.67  < 2e-16 ***
## Fsize2                  0.1508     0.2536    0.59     0.55    
## Fsize3                  0.0564     0.3175    0.18     0.86    
## Fsize4                 -0.1136     0.5764   -0.20     0.84    
## Fsize5+                 2.7551     0.4579    6.02  1.8e-09 ***
## WomanChild12_1stWomen   0.4909     0.4039    1.22     0.22    
## WomanChild12_1stMen     3.7539     0.4372    8.59  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  726.79  on 882  degrees of freedom
## AIC: 744.8
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.fit19.log <- predict.glm(fit19.log, newdata = dt.train, type = "response")
dt.train$pred.fit19.log <- ifelse(dt.train$pred.fit19.log > 0.5,1,0)
dt.train$pred.fit19.log <- factor(dt.train$pred.fit19.log, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.fit19.log, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      250   57
##   Died           92  492
##                                         
##                Accuracy : 0.833         
##                  95% CI : (0.807, 0.857)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.64          
##  Mcnemar's Test P-Value : 0.00535       
##                                         
##             Sensitivity : 0.731         
##             Specificity : 0.896         
##          Pos Pred Value : 0.814         
##          Neg Pred Value : 0.842         
##              Prevalence : 0.384         
##          Detection Rate : 0.281         
##    Detection Prevalence : 0.345         
##       Balanced Accuracy : 0.814         
##                                         
##        'Positive' Class : Survived      
## 

Decision Tree ( Accuracy : 0.835)

fit19.dt <- rpart(Survived ~ Pclass + Fsize + WomanChild12_1st , data=dt.train, method="class")
fancyRpartPlot(fit19.dt)

dt.train$pred.fit19.dt <- predict(fit19.dt, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit19.dt, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   57
##   Died           90  492
##                                         
##                Accuracy : 0.835         
##                  95% CI : (0.809, 0.859)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.645         
##  Mcnemar's Test P-Value : 0.00831       
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.896         
##          Pos Pred Value : 0.816         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.347         
##       Balanced Accuracy : 0.817         
##                                         
##        'Positive' Class : Survived      
## 

Random Tree ( Accuracy : 0.834)

fit19.rf <- randomForest(Survived ~ Pclass + Fsize + WomanChild12_1st,
                    data=dt.train, importance=TRUE, ntree=2000)
dt.train$pred.fit19.rf <- predict(fit19.rf, newdata=dt.train,type='class')
confusionMatrix(dt.train$pred.fit19.rf, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      251   57
##   Died           91  492
##                                         
##                Accuracy : 0.834         
##                  95% CI : (0.808, 0.858)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : < 2e-16       
##                                         
##                   Kappa : 0.642         
##  Mcnemar's Test P-Value : 0.00668       
##                                         
##             Sensitivity : 0.734         
##             Specificity : 0.896         
##          Pos Pred Value : 0.815         
##          Neg Pred Value : 0.844         
##              Prevalence : 0.384         
##          Detection Rate : 0.282         
##    Detection Prevalence : 0.346         
##       Balanced Accuracy : 0.815         
##                                         
##        'Positive' Class : Survived      
## 
# Look at variable importance
varImpPlot(fit19.rf)

Model 20: Survived ~ complete

Logistic (Accuracy : 0.833)

fit20.log_C <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + 
                Title + NewTitle + WomanChild12_1st +WomanChild14_1st+
                Fsize + FamilySize_dataSet , family = binomial(link='logit'), data = dt.train)

fit20.log_N <- glm(Survived ~ 1 , family = binomial(link='logit'), data = dt.train)

#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
forwards = step(fit20.log_N,scope=list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)), direction="forward")
## Start:  AIC=1189
## Survived ~ 1
## 
##                      Df Deviance  AIC
## + WomanChild12_1st    2      882  888
## + WomanChild14_1st    2      887  893
## + NewTitle            4      883  893
## + Title              16      869  903
## + Sex                 1      918  922
## + Pclass              2     1083 1089
## + Fsize               4     1108 1118
## + Embarked            2     1161 1167
## + Parch               1     1181 1185
## + Age                 1     1181 1185
## <none>                      1187 1189
## + FamilySize_dataSet  1     1185 1189
## + SibSp               1     1186 1190
## 
## Step:  AIC=888
## Survived ~ WomanChild12_1st
## 
##                      Df Deviance AIC
## + Pclass              2      779 789
## + Fsize               4      810 824
## + FamilySize_dataSet  1      826 834
## + SibSp               1      845 853
## + Embarked            2      860 870
## + Parch               1      869 877
## + Age                 1      880 888
## <none>                       882 888
## + Sex                 1      882 890
## + WomanChild14_1st    2      881 891
## + NewTitle            4      879 893
## + Title              16      866 904
## 
## Step:  AIC=789
## Survived ~ WomanChild12_1st + Pclass
## 
##                      Df Deviance AIC
## + FamilySize_dataSet  1      732 744
## + Fsize               4      727 745
## + SibSp               1      748 760
## + Parch               1      767 779
## + Age                 1      773 785
## + Embarked            2      771 785
## <none>                       779 789
## + Sex                 1      779 791
## + WomanChild14_1st    2      779 793
## + NewTitle            4      778 796
## + Title              16      770 812
## 
## Step:  AIC=744
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet
## 
##                    Df Deviance AIC
## + Fsize             4      720 740
## + Age               1      727 741
## + Embarked          2      728 744
## <none>                     732 744
## + SibSp             1      731 745
## + Parch             1      731 745
## + Sex               1      732 746
## + WomanChild14_1st  2      732 748
## + NewTitle          4      728 748
## + Title            16      721 765
## 
## Step:  AIC=740
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize
## 
##                    Df Deviance AIC
## + Age               1      715 737
## <none>                     720 740
## + Embarked          2      717 741
## + Parch             1      719 741
## + Sex               1      719 741
## + SibSp             1      719 741
## + WomanChild14_1st  2      719 743
## + NewTitle          4      717 745
## + Title            16      710 762
## 
## Step:  AIC=737
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize + 
##     Age
## 
##                    Df Deviance AIC
## <none>                     715 737
## + Parch             1      714 738
## + Sex               1      714 738
## + SibSp             1      714 738
## + Embarked          2      712 738
## + NewTitle          4      710 740
## + WomanChild14_1st  2      714 740
## + Title            16      702 756
#Best Model: Age + Pclass +  Fsize + FamilySize_dataSet + WomanChild12_1st
backwards = step(fit20.log_C) # Backwards selection is the default
## Start:  AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Title + 
##     NewTitle + WomanChild12_1st + WomanChild14_1st + Fsize + 
##     FamilySize_dataSet
## 
## 
## Step:  AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Title + 
##     WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
## 
## 
## Step:  AIC=764
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Title + 
##     WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
## 
##                      Df Deviance AIC
## - Title              16      711 745
## - WomanChild14_1st    2      699 761
## - WomanChild12_1st    2      700 762
## - SibSp               1      699 763
## - Embarked            2      701 763
## - Parch               1      699 763
## - Fsize               4      706 764
## <none>                       698 764
## - Age                 1      705 769
## - FamilySize_dataSet  1      706 770
## - Pclass              2      774 836
## 
## Step:  AIC=745
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + WomanChild12_1st + 
##     WomanChild14_1st + Fsize + FamilySize_dataSet
## 
##                      Df Deviance AIC
## - WomanChild14_1st    2      711 741
## - SibSp               1      711 743
## - Embarked            2      713 743
## - Parch               1      712 744
## <none>                       711 745
## - WomanChild12_1st    2      715 745
## - Fsize               4      721 747
## - Age                 1      716 748
## - FamilySize_dataSet  1      718 750
## - Pclass              2      786 816
## 
## Step:  AIC=741
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + WomanChild12_1st + 
##     Fsize + FamilySize_dataSet
## 
##                      Df Deviance AIC
## - Embarked            2      713 739
## - SibSp               1      712 740
## - Parch               1      712 740
## <none>                       711 741
## - Fsize               4      721 743
## - Age                 1      716 744
## - FamilySize_dataSet  1      718 746
## - Pclass              2      787 813
## - WomanChild12_1st    2      970 996
## 
## Step:  AIC=739
## Survived ~ Pclass + Age + SibSp + Parch + WomanChild12_1st + 
##     Fsize + FamilySize_dataSet
## 
##                      Df Deviance  AIC
## - SibSp               1      714  738
## - Parch               1      714  738
## <none>                       713  739
## - Fsize               4      725  743
## - Age                 1      719  743
## - FamilySize_dataSet  1      721  745
## - Pclass              2      799  821
## - WomanChild12_1st    2      984 1006
## 
## Step:  AIC=738
## Survived ~ Pclass + Age + Parch + WomanChild12_1st + Fsize + 
##     FamilySize_dataSet
## 
##                      Df Deviance  AIC
## - Parch               1      715  737
## <none>                       714  738
## - Age                 1      719  741
## - Fsize               4      726  742
## - FamilySize_dataSet  1      721  743
## - Pclass              2      799  819
## - WomanChild12_1st    2      984 1004
## 
## Step:  AIC=737
## Survived ~ Pclass + Age + WomanChild12_1st + Fsize + FamilySize_dataSet
## 
##                      Df Deviance  AIC
## <none>                       715  737
## - Age                 1      720  740
## - Fsize               4      727  741
## - FamilySize_dataSet  1      722  742
## - Pclass              2      800  818
## - WomanChild12_1st    2      991 1009
#Best Model: Age + Pclass +  Fsize + FamilySize_dataSet + WomanChild12_1st
bothways = step(fit20.log_N, list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)),
                direction="both",trace=0)
summary(forwards)
## 
## Call:
## glm(formula = Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + 
##     Fsize + Age, family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.450  -0.550   0.399   0.514   2.760  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -4.43121    0.62221   -7.12  1.1e-12 ***
## WomanChild12_1stWomen  0.35656    0.49401    0.72   0.4704    
## WomanChild12_1stMen    3.54989    0.52805    6.72  1.8e-11 ***
## Pclass2nd Class        1.37139    0.28349    4.84  1.3e-06 ***
## Pclass3rd Class        2.29608    0.26415    8.69  < 2e-16 ***
## FamilySize_dataSet     0.46001    0.17729    2.59   0.0095 ** 
## Fsize2                -0.07101    0.26780   -0.27   0.7909    
## Fsize3                -0.36491    0.35246   -1.04   0.3005    
## Fsize4                -0.77762    0.64549   -1.20   0.2283    
## Fsize5+                1.30217    0.70647    1.84   0.0653 .  
## Age                    0.02054    0.00934    2.20   0.0278 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  714.66  on 880  degrees of freedom
## AIC: 736.7
## 
## Number of Fisher Scoring iterations: 5
dt.train$pred.forwards <- predict.glm(forwards, newdata = dt.train, type = "response")
dt.train$pred.forwards <- ifelse(dt.train$pred.forwards > 0.5,1,0)
dt.train$pred.forwards <- factor(dt.train$pred.forwards, levels = c(0,1),labels = c("Survived","Died"))
confusionMatrix(dt.train$pred.forwards, dt.train$Survived)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Survived Died
##   Survived      252   59
##   Died           90  490
##                                         
##                Accuracy : 0.833         
##                  95% CI : (0.807, 0.857)
##     No Information Rate : 0.616         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.64          
##  Mcnemar's Test P-Value : 0.014         
##                                         
##             Sensitivity : 0.737         
##             Specificity : 0.893         
##          Pos Pred Value : 0.810         
##          Neg Pred Value : 0.845         
##              Prevalence : 0.384         
##          Detection Rate : 0.283         
##    Detection Prevalence : 0.349         
##       Balanced Accuracy : 0.815         
##                                         
##        'Positive' Class : Survived      
## 

Titanic Test Dataset

dt.test <- readData(test.data,test.VariableType, missingNA)

dt.test$Pclass <- as.factor(dt.test$Pclass)
levels(dt.test$Pclass) <- c("1st Class", "2nd Class", "3rd Class")

dt.test$Sex <- factor(dt.test$Sex, levels=c("female", "male"))
levels(dt.test$Sex) <- c("Female", "Male")


# Graphs and tables from training and testing datasets
mosaicplot(Pclass ~ Sex,
           data=dt.test, main="Titanic Test Data: Passenger Survival by Class",
           color=c("#8dd3c7", "#fb8072"), shade=FALSE,  xlab="", ylab="",
           off=c(0), cex.axis=1.4)

which(is.na(dt.test$Fare))
## [1] 153
dt.test$Fare[153] <- median(dt.test$Fare, na.rm=TRUE) #impute median of Fare in the test dataset

# Grab title from passenger names
dt.test$Title <- gsub('(.*, )|(\\..*)', '', dt.test$Name)
table(dt.test$Title)
## 
##    Col   Dona     Dr Master   Miss     Mr    Mrs     Ms    Rev 
##      2      1      1     21     78    240     72      1      2
options(digits=2)
with(dt.test,bystats(Age, Title, 
                     fun=function(x)c(Mean=mean(x),Median=median(x))))
## 
##  c(25, 26, 25, 68, 26, 68, 25, 25) of Age by Title 
## 
##          N Missing Mean Median
## Col      2       0 50.0     50
## Dona     1       0 39.0     39
## Dr       1       0 53.0     53
## Master  17       4  7.4      7
## Miss    64      14 21.8     22
## Mr     183      57 32.0     28
## Mrs     62      10 38.9     36
## Ms       0       1   NA     NA
## Rev      2       0 35.5     36
## ALL    332      86 30.3     27
summary(dt.test$Embarked)  #The variable Embarked has no missing information 
##   C   Q   S 
## 102  46 270
## list of all titles 
titles <- c("Mr","Mrs","Miss","Master","Don","Rev",
            "Dr","Mme","Ms","Major","Lady","Sir",
            "Mlle","Col","Capt","the Countess","Jonkheer","Dona")

dt.test$Age <- imputeMedian(dt.test$Age,dt.test$Title,titles)
dt.test$Age[which(dt.test$Title=="Ms")] <-36 #Impute Median of age for Mrs

q<-ggplot(dt.test, aes(x=Age, fill=Pclass)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Tes Data: Age by Class")
q1<-q+scale_fill_manual(name="Class",values=c("blue","green", "pink"))
q2<-q1+scale_color_manual(values=c("blue","green", "pink"))
q2

## assigning a new title value to old title(s) 
dt.test$NewTitle[dt.test$Title %in% c("Col","Dr", "Rev")] <- 0 #There are a Woman Dr
dt.test$NewTitle[dt.test$Title %in% c("Mrs", "Ms","Dona")] <- 1
dt.test$NewTitle[dt.test$Title %in% c("Master")] <- 2
dt.test$NewTitle[dt.test$Title %in% c("Miss", "Mlle")] <- 3
dt.test$NewTitle[dt.test$Title %in% c("Mr", "Sir", "Jonkheer")] <- 4
dt.test$NewTitle <- factor(dt.test$NewTitle)

dt.test$NewTitle <- as.factor(dt.test$NewTitle)
levels(dt.test$NewTitle) <- c("Special", "Mrs", "Master","Miss","Mr")
table(dt.test$NewTitle)
## 
## Special     Mrs  Master    Miss      Mr 
##       5      74      21      78     240
#Based on the fact that during a disaster the priority are women and children first, 
#we are gois to create a variable that separate children, women, and men

#The variable WomanChild12_1st assume the Child is someone under the age 13
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Master")] <- 0
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age<=12] <- 0
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age>12] <- 1
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Mrs")] <- 1
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Male"] <- 2 
dt.test$WomanChild12_1st[dt.test$NewTitle %in% c("Mr")] <- 2
dt.test$WomanChild12_1st <- as.factor(dt.test$WomanChild12_1st)
levels(dt.test$WomanChild12_1st) <- c("Children", "Women", "Men")

table(dt.test$WomanChild12_1st, dt.test$NewTitle)
##           
##            Special Mrs Master Miss  Mr
##   Children       0   0     21   12   0
##   Women          0  74      0   66   0
##   Men            5   0      0    0 240
#The variable WomanChild14_1st assume the Child is someone under the age 15
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Master")] <-0
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age<=14] <- 0
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Miss") & dt.test$Age>14] <- 1
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Mrs")] <- 1
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Female"] <- 1 #For example for a Dr Woman
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Special") & dt.test$Sex=="Male"] <- 2 
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Mr") & dt.test$Age<=14] <- 0
dt.test$WomanChild14_1st[dt.test$NewTitle %in% c("Mr") & dt.test$Age>14] <- 2
dt.test$WomanChild14_1st <- as.factor(dt.test$WomanChild14_1st)
levels(dt.test$WomanChild14_1st) <- c("Children", "Women", "Men")

table(dt.test$WomanChild14_1st, dt.test$NewTitle)
##           
##            Special Mrs Master Miss  Mr
##   Children       0   0     21   12   2
##   Women          0  74      0   66   0
##   Men            5   0      0    0 238
q<-ggplot(dt.test, aes(x=Age, fill=WomanChild12_1st)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Test Data: Survival of Women and Children First code")
q1<-q+scale_fill_manual(name="Women & Children (< 13 years)\nFirst",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
q2

q<-ggplot(dt.test, aes(x=Age, fill=WomanChild14_1st)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Test Data: Survival of Women and Children First code")
q1<-q+scale_fill_manual(name="Women & Children (< 15 years)\nFirst",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
q2

dt.test$FamilySize <- dt.test$SibSp + dt.test$Parch + 1 # Passeger + #Siblings + Spouse + 
#Parents + Children Aboard

#From the table above, we can see:
# FamilySize = 1: passengers alone are more likely to die
# FamilySize = 2, 3 or 4: passengers with 1 to 3 family members  are more likely to survive
# FamilySize = 5 or more: passengers with a family size of 5 or more are more likely to die#
dt.test$Fsize[dt.test$FamilySize == 1] <- 1
dt.test$Fsize[dt.test$FamilySize == 2] <- 2
dt.test$Fsize[dt.test$FamilySize == 3] <- 3
dt.test$Fsize[dt.test$FamilySize == 4] <- 4
dt.test$Fsize[dt.test$FamilySize >= 5] <- 5 
dt.test$Fsize <- as.factor(dt.test$Fsize)

levels(dt.test$Fsize) <- c("1", "2", "3","4","5+")
table(dt.test$Fsize)
## 
##   1   2   3   4  5+ 
## 253  74  57  14  20
with(dt.test,table(Fsize, Sex))
##      Sex
## Fsize Female Male
##    1      68  185
##    2      36   38
##    3      30   27
##    4      10    4
##    5+      8   12
par(mfrow=c(1,1))
boxplot(Age ~ FamilySize, data =dt.test, xlab="Family Size on the Ship", 
        ylab="Age (years)", main="Titanic Test Data",col=c(2:8,"pink","orange"))

#Family Name
dt.test$FamilyName <- gsub(",.*$", "", dt.test$Name)

#To create a FamilyID, we are going to paste the Family size Aboard on Titanic with the 
#Passenger's Surname 
dt.test$FamilyID <- paste(as.character(dt.test$FamilySize), dt.test$FamilyName, sep="")
dt.test$FamilyID_Embk_Ticket <- paste(dt.test$FamilyID,dt.test$Embarked, as.character(dt.test$Ticket), sep="_")
dt.test$FamilyID_dataSet <- match(dt.test$FamilyID_Embk_Ticket, unique(dt.test$FamilyID_Embk_Ticket))
dt.test$FamilySize_dataSet <- ave(dt.test$FamilyID_dataSet,dt.test$FamilyID_dataSet, FUN =length)

summary(dt.test$FamilySize_dataSet)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0     1.2     1.0     4.0
plot(dt.test$FamilyID_dataSet, dt.test$FamilySize, xlab="Family ID in the dataset",
     ylab="Family Size on the Ship",main= "Titanic Test dataset")

plot(dt.test$FamilySize_dataSet,dt.test$FamilySize, xlab="Family Size in the dataset",
     ylab="Family Size on the Ship",main= "Titanic Test dataset")

table(factor(dt.test$FamilySize),factor(dt.test$FamilySize_dataSet))
##     
##        1   2   3   4
##   1  253   0   0   0
##   2   54  20   0   0
##   3   32  22   3   0
##   4    6   8   0   0
##   5    4   0   3   0
##   6    1   2   0   0
##   7    1   0   3   0
##   8    0   2   0   0
##   11   0   0   0   4
#Fare
q<-ggplot(dt.test, aes(x=Fare, fill=Pclass)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Test Data: Fare by Class")
q1<-q+scale_fill_manual(name="Class",values=c("green","blue", "pink"))
q2<-q1+scale_color_manual(values=c("green","blue", "pink"))
q2

with(dt.test, {
        boxplot(Fare ~ FamilySize, xlab="Family Size on the Titanic", 
                ylab="Fare", main="Titanic Test Data", col=c(2:8,"pink","orange"))
})

par(mfrow=c(1,2))
with(dt.test, {
        boxplot(Fare ~ Fsize, xlab="Family Size on the Titanic", 
                ylab="Fare", main="Titanic Test Data", col=2:10)
        boxplot(Fare ~ Fsize, xlab="Family Size on the Titanic", 
                ylab="Fare", main="Titanic Test Data", col=2:10, ylim=c(0,250))
        
})

q<-ggplot(dt.test, aes(x=Fare, fill=FamilySize)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Test Data: Fare by Family Size")

q<-ggplot(dt.test, aes(x=Fare, fill=Fsize)) +
        geom_histogram(position="identity", alpha=0.5,bins=90)  +
        labs(title="Titanic Test Data: Fare by Family Size")

dt.test$FamilyID <- paste(as.character(dt.test$FamilySize), dt.test$FamilyName, sep="")
set.seed(12345)

Model 19: Survived ~ Pclass + Fsize + WomanChild12_1st

Logistic (Accuracy = 0.833 and Kaggle Score = 0.78947)

#fit19.log <- glm(Survived ~ Pclass + Fsize + WomanChild12_1st , family = binomial(link='logit'), data = dt.train)
dt.test$pred.fit19.log <- predict.glm(fit19.log, newdata = dt.test, type = "response")
dt.test$pred.fit19.log <- ifelse(dt.test$pred.fit19.log > 0.5,0,1)

#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit19.log)
write.csv(submit, file = "Prediction_model19_logistic.csv", row.names = FALSE)

Model 13: Survived ~ Sex + Age + Pclass + Fsize + NewTitle

Logistic (Accuracy = 0.833 and Kaggle Score = 0.78469)

fit13.log <- glm(Survived ~ Sex + Age + Pclass + Fsize + NewTitle, family = binomial(link='logit'), data = dt.train)
dt.test$pred.fit13.log <- predict.glm(fit13.log, newdata = dt.test, type = "response")
dt.test$pred.fit13.log <- ifelse(dt.test$pred.fit13.log > 0.5,0,1)

#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit13.log)
write.csv(submit, file = "Prediction_model13_logistic.csv", row.names = FALSE)

Model 20: Survived ~ complete

Logistic (Accuracy = 0.833 and Kaggle Score = 0.77990)

Using Stepwise, the best model has the same variables using backwards, backwards or bothways

fit20.log_C <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Fare+
                           Title + NewTitle + WomanChild12_1st +WomanChild14_1st+
                           Fsize + FamilySize_dataSet , family = binomial(link='logit'), data = dt.train)

fit20.log_N <- glm(Survived ~ 1 , family = binomial(link='logit'), data = dt.train)

#Best Model: Age + Pclass + Fsize + FamilySize_dataSet + WomanChild12_1st
forwards = step(fit20.log_N,scope=list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)), direction="forward")
## Start:  AIC=1189
## Survived ~ 1
## 
##                      Df Deviance  AIC
## + WomanChild12_1st    2      882  888
## + WomanChild14_1st    2      887  893
## + NewTitle            4      883  893
## + Title              16      869  903
## + Sex                 1      918  922
## + Pclass              2     1083 1089
## + Fsize               4     1108 1118
## + Fare                1     1118 1122
## + Embarked            2     1161 1167
## + Parch               1     1181 1185
## + Age                 1     1181 1185
## <none>                      1187 1189
## + FamilySize_dataSet  1     1185 1189
## + SibSp               1     1186 1190
## 
## Step:  AIC=888
## Survived ~ WomanChild12_1st
## 
##                      Df Deviance AIC
## + Pclass              2      779 789
## + Fsize               4      810 824
## + FamilySize_dataSet  1      826 834
## + SibSp               1      845 853
## + Fare                1      852 860
## + Embarked            2      860 870
## + Parch               1      869 877
## + Age                 1      880 888
## <none>                       882 888
## + Sex                 1      882 890
## + WomanChild14_1st    2      881 891
## + NewTitle            4      879 893
## + Title              16      866 904
## 
## Step:  AIC=789
## Survived ~ WomanChild12_1st + Pclass
## 
##                      Df Deviance AIC
## + FamilySize_dataSet  1      732 744
## + Fsize               4      727 745
## + SibSp               1      748 760
## + Parch               1      767 779
## + Age                 1      773 785
## + Embarked            2      771 785
## <none>                       779 789
## + Sex                 1      779 791
## + Fare                1      779 791
## + WomanChild14_1st    2      779 793
## + NewTitle            4      778 796
## + Title              16      770 812
## 
## Step:  AIC=744
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet
## 
##                    Df Deviance AIC
## + Fsize             4      720 740
## + Age               1      727 741
## + Fare              1      729 743
## + Embarked          2      728 744
## <none>                     732 744
## + SibSp             1      731 745
## + Parch             1      731 745
## + Sex               1      732 746
## + WomanChild14_1st  2      732 748
## + NewTitle          4      728 748
## + Title            16      721 765
## 
## Step:  AIC=740
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize
## 
##                    Df Deviance AIC
## + Age               1      715 737
## + Fare              1      716 738
## <none>                     720 740
## + Embarked          2      717 741
## + Parch             1      719 741
## + Sex               1      719 741
## + SibSp             1      719 741
## + WomanChild14_1st  2      719 743
## + NewTitle          4      717 745
## + Title            16      710 762
## 
## Step:  AIC=737
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize + 
##     Age
## 
##                    Df Deviance AIC
## + Fare              1      712 736
## <none>                     715 737
## + Parch             1      714 738
## + Sex               1      714 738
## + SibSp             1      714 738
## + Embarked          2      712 738
## + NewTitle          4      710 740
## + WomanChild14_1st  2      714 740
## + Title            16      702 756
## 
## Step:  AIC=736
## Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + Fsize + 
##     Age + Fare
## 
##                    Df Deviance AIC
## <none>                     712 736
## + Parch             1      711 737
## + Sex               1      711 737
## + SibSp             1      712 738
## + Embarked          2      710 738
## + NewTitle          4      706 738
## + WomanChild14_1st  2      711 739
## + Title            16      699 755
#Best Model: Age + Pclass +  Fsize + FamilySize_dataSet + WomanChild12_1st
backwards = step(fit20.log_C) # Backwards selection is the default
## Start:  AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Fare + 
##     Title + NewTitle + WomanChild12_1st + WomanChild14_1st + 
##     Fsize + FamilySize_dataSet
## 
## 
## Step:  AIC=764
## Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + Fare + 
##     Title + WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
## 
## 
## Step:  AIC=764
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Fare + Title + 
##     WomanChild12_1st + WomanChild14_1st + Fsize + FamilySize_dataSet
## 
##                      Df Deviance AIC
## - Title              16      709 745
## - WomanChild14_1st    2      696 760
## - Embarked            2      698 762
## - SibSp               1      696 762
## - WomanChild12_1st    2      698 762
## - Parch               1      696 762
## <none>                       696 764
## - Fsize               4      704 764
## - Fare                1      698 764
## - Age                 1      702 768
## - FamilySize_dataSet  1      703 769
## - Pclass              2      736 800
## 
## Step:  AIC=745
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Fare + WomanChild12_1st + 
##     WomanChild14_1st + Fsize + FamilySize_dataSet
## 
##                      Df Deviance AIC
## - WomanChild14_1st    2      709 741
## - Embarked            2      711 743
## - SibSp               1      709 743
## - Parch               1      710 744
## <none>                       709 745
## - Fare                1      711 745
## - WomanChild12_1st    2      713 745
## - Age                 1      713 747
## - Fsize               4      719 747
## - FamilySize_dataSet  1      716 750
## - Pclass              2      750 782
## 
## Step:  AIC=741
## Survived ~ Pclass + Age + SibSp + Parch + Embarked + Fare + WomanChild12_1st + 
##     Fsize + FamilySize_dataSet
## 
##                      Df Deviance AIC
## - Embarked            2      711 739
## - SibSp               1      710 740
## - Parch               1      710 740
## <none>                       709 741
## - Fare                1      711 741
## - Age                 1      713 743
## - Fsize               4      719 743
## - FamilySize_dataSet  1      716 746
## - Pclass              2      751 779
## - WomanChild12_1st    2      966 994
## 
## Step:  AIC=739
## Survived ~ Pclass + Age + SibSp + Parch + Fare + WomanChild12_1st + 
##     Fsize + FamilySize_dataSet
## 
##                      Df Deviance  AIC
## - SibSp               1      711  737
## - Parch               1      712  738
## <none>                       711  739
## - Fare                1      713  739
## - Age                 1      715  741
## - Fsize               4      722  742
## - FamilySize_dataSet  1      718  744
## - Pclass              2      753  777
## - WomanChild12_1st    2      979 1003
## 
## Step:  AIC=737
## Survived ~ Pclass + Age + Parch + Fare + WomanChild12_1st + Fsize + 
##     FamilySize_dataSet
## 
##                      Df Deviance  AIC
## - Parch               1      712  736
## <none>                       711  737
## - Fare                1      714  738
## - Age                 1      715  739
## - Fsize               4      723  741
## - FamilySize_dataSet  1      718  742
## - Pclass              2      753  775
## - WomanChild12_1st    2      979 1001
## 
## Step:  AIC=736
## Survived ~ Pclass + Age + Fare + WomanChild12_1st + Fsize + FamilySize_dataSet
## 
##                      Df Deviance  AIC
## <none>                       712  736
## - Fare                1      715  737
## - Age                 1      716  738
## - Fsize               4      724  740
## - FamilySize_dataSet  1      719  741
## - Pclass              2      754  774
## - WomanChild12_1st    2      985 1005
#Best Model: Age + Pclass +  Fsize + FamilySize_dataSet + WomanChild12_1st
bothways = step(fit20.log_N, list(lower=formula(fit20.log_N),upper=formula(fit20.log_C)),
                direction="both",trace=0)
summary(forwards)
## 
## Call:
## glm(formula = Survived ~ WomanChild12_1st + Pclass + FamilySize_dataSet + 
##     Fsize + Age + Fare, family = binomial(link = "logit"), data = dt.train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.366  -0.535   0.404   0.512   2.863  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -4.21968    0.64107   -6.58  4.6e-11 ***
## WomanChild12_1stWomen  0.48603    0.50674    0.96  0.33749    
## WomanChild12_1stMen    3.67084    0.53981    6.80  1.0e-11 ***
## Pclass2nd Class        1.14081    0.31682    3.60  0.00032 ***
## Pclass3rd Class        2.02139    0.31345    6.45  1.1e-10 ***
## FamilySize_dataSet     0.46468    0.17932    2.59  0.00956 ** 
## Fsize2                 0.00299    0.27355    0.01  0.99129    
## Fsize3                -0.26150    0.36034   -0.73  0.46802    
## Fsize4                -0.65904    0.65460   -1.01  0.31404    
## Fsize5+                1.53487    0.73665    2.08  0.03720 *  
## Age                    0.01853    0.00943    1.96  0.04946 *  
## Fare                  -0.00418    0.00267   -1.56  0.11767    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  711.77  on 879  degrees of freedom
## AIC: 735.8
## 
## Number of Fisher Scoring iterations: 5
dt.test$pred.forwards <- predict.glm(forwards, newdata = dt.test, type = "response")
dt.test$pred.forwards <- ifelse(dt.test$pred.forwards > 0.5,0,1)

#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.forwards)
write.csv(submit, file = "Prediction_model21_StepF_logistic.csv", row.names = FALSE)

Model 9: Survived ~ Sex + Age + Pclass + Fsize

Decision Tree (Accuracy = 0.841 and Kaggle Score = 0.79426)

fit9.dt <- rpart(Survived ~ Sex + Age + Pclass + Fsize, data=dt.train, method="class")
dt.test$pred.fit9.dt <- predict(fit9.dt, newdata=dt.test,type='class')
dt.test$pred.fit9.dt[dt.test$pred.fit9.dt=="Died"]<-0
## Warning in `[<-.factor`(`*tmp*`, dt.test$pred.fit9.dt == "Died", value =
## structure(c(NA, : invalid factor level, NA generated
dt.test$pred.fit9.dt[dt.test$pred.fit9.dt=="Survived"]<-1
## Warning in `[<-.factor`(`*tmp*`, dt.test$pred.fit9.dt == "Survived", value
## = structure(c(NA_integer_, : invalid factor level, NA generated
dt.test$pred.fit9.dt <- as.numeric(dt.test$pred.fit9.dt)
#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit9.dt)
write.csv(submit, file = "Prediction_model9_DecisionTree.csv", row.names = FALSE)

Model 6: Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked

Random Tree (Accuracy = 0.869 and Kaggle Score = 0.79426)

fit6.rf <- randomForest(Survived ~ Sex + Age + Pclass + SibSp + Parch + Embarked,
                        data=dt.train, importance=TRUE, ntree=2000)
dt.test$pred.fit6.rf <- predict(fit6.rf, newdata=dt.test,type='class')
dt.test$pred.fit6.randomf[dt.test$pred.fit6.rf=="Died"]<-0
dt.test$pred.fit6.randomf[dt.test$pred.fit6.rf=="Survived"]<-1
dt.test$pred.fit6.randomf <- as.numeric(dt.test$pred.fit6.randomf)
#Submitting
submit <- data.frame(PassengerId = dt.test$PassengerId, Survived = dt.test$pred.fit6.randomf)
write.csv(submit, file = "Prediction_model6_RandomForest.csv", row.names = FALSE)